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Abstract 


)  This  paper  addresses  the  nonlinear  least-squares  prob)emjnHntfc-y«~ff/(:  r^-whenr 
'  '  /(sb)4s  »  vector  in  Sm  whose  components  are  smooth  nonlinear  functions.  The  problem 

h k,  arises  most  often  in  data  fitting  applications.  Much  research  has  focused  on  the  devel¬ 
opment  of  specialized  algorithms  that  attempt  to  exploit  the  structure  of  the  nonlinear 
least-squares  objectiveyt^fsurvejtf  numerical  methods  developed  for  problems  in  which 
sparsity  in  the  derivatives  of  /  is  not  taken  into  account  in  formulating  algorithms.  ^ 
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1.  Introduction 


This  paper  addresses  the  problem  of  minimizing  the  1%  norm  of  a  multivariate  func¬ 
tion  : 

min  ~  ||/(x)||2, 

where  f(x)  is  a  vector  in  3im  whose  components  are  real-valued  nonlinear  functions  with 
continuous  second  partial  derivatives.  We  shall  refer  to  the  function  1 1|/(*)||2  as 
nonlinear  least-squares  objective  function.  An  alternative  formulation  of  the  problem 
is  that  of  minimizing  a  sum  of  squares : 

1  m 
t  =  l 

where  each  <pi  is  a  real-valued  function  having  continuous  second  partial  derivatives. 

There  is  considerable  interest  in  the  nonlinear  least-squares  problem,  because  it 
arises  in  virtually  all  areas  of  quantitative  research  in  data-fitting  applications.  A  typical 
instance  is  the  choice  of  parameters  (3  within  a  nonlinear  model  ip  so  that  the  model 
agrees  with  measured  quantities  d,  as  closely  as  possible : 

•m  1 
i=l 

where  r,  are  prescibed  values.  Much  research  has  focused  on  the  development  of  spe¬ 
cialized  algorithms  that  attempt  to  exploit  the  structure  of  the  nonlinear  least-squares 
objective.  Despite  these  efForts,  methods  do  not  perform  equally  well  on  all  problems, 
and  it  is  generally  not  possible  to  characterize  those  problems  on  which  a  particular 
method  will  or  will  not  work  well. 

In  this  paper,  we  survey  existing  numerical  methods  for  dense  nonlinear  least-squares 
problems.  For  a  study  of  the  performance  of  widely-distributed  software  for  nonlinear 
least-squares,  see  Fraley  [1988b].  We  assume  a  knowledge  of  numerical  methods  for 
linear  least-squares  problems  (e.  g.,  Lawson  and  Hanson  [1974],  and  Golub  and  Van 
Loan  [1983]).  We  also  assume  familiarity  with  Newton-based  linesearch  and  trust- 
region  methods  for  unconstrained  minimization  (e.  g.,  Fletcher  [1930],  Gill,  Murray,  and 
Wright  [1981],  Dennis  and  Schnabel  [1983],  and  More  and  Sorensen  [1984]).  If  T  is  the 
function  to  be  minimized,  recall  that  both  linesearch  and  trust-region  methods  involve 
iterative  minimization  of  a  quadratic  local  model 

Q(p)  =  V; F(xk)Tp  +  i  pTHkp 
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for  P(xis  +  p)- f(xk),  the  change  in  T  at  the  current  iterate  x*.  In  linesearch  methods, 
the  vector  p£s  defined  by 

Pks  =  axg  min  p6«.  Q(p) 

is  used  as  a  search  direction.  A  positive  step  is  taken  from  x*  along  p£s  to  the  next 
iterate,  that  is, 

**+1  =  Xk  +  a/sPfcS) 

where  the  steplength  a*  >  0  is  computed  by  approximate  minimization  of  the  function 
$fe(a)  =  !F(ik  +  otPkS )•  The  vector  p£s  must  be  a  descent  direction  for  f  at  Xk  — 
in  other  words,  VT(xk)rPkS  <0  —  so  that  T  initially  decreases  along  p\s  from  x*. 
Normally  Hk  is  required  to  be  positive  definite,  which  guarantees  that  the  quadratic 
model  has  a  unique  minimum  that  is  a  descent  direction.  In  trust-region  methods, 

Xk+\  =  X  fc  +  Pfc*, 


where 

PkR  =  arg  min  p€S.  Q(p)  subject  to  ||p||  <  6k. 

The  rationale  for  restricting  the  size  of  p  in  the  subproblem  is  that  Q(p)  is  a  good 
approximation  to  T  only  at  points  close  to  x*. 

1.1  Definitions  and  Notation 

We  shall  use  the  following  definitions  and  notational  conventions : 

•  Generally  subscripts  on  a  function  mean  that  the  function  is  evaluated  at  the 
corresponding  subscripted  variable  (for  example,  /*  =  /(x*)).  An  exception  is 
made  for  the  residual  functions  fa,  where  the  subscript  is  the  component  index  for 
the  vector  /. 

•  /  -  The  vector  of  nonlinear  functions  whose  norm  is  to  be  minimized. 

The  nonlinear  least-squares  problem  is 

mjn  \  f(x)Tf(x), 

where  the  factor  |  is  introduced  in  order  to  avoid  a  factor  of  two  in  the  derivatives. 
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•  4>i  -  The  tth  residual  function,  also  the  tth  component  of  the  vector  /. 


/(*)  = 

An  alternative  formulation  of  the  nonlinear  (east-squares  problem  is 

1  m 

min  -  >  4>i(z)2, 
x€S»  2  H 
«=i 

where  each  <pi(x)  is  a  smooth  function  mapping  ft”  to  ft. 

•  J  -  The  m  x  n  Jacobian  matrix  of  /. 

J(x)  =  Vf(x)  = 


Mi. 

dx> 

•  •  •  9r» 

fi&a 

Mo. 

dx\ 

din 

•  g  -  The  gradient  of  the  nonlinear  least-squares  objective. 

$(*)  =  V  0/(*)T/(a:)j  =  /(i)T/(x) 

•  B  -  The  part  of  the  Hessian  matrix  of  the  nonlinear  least-squares  objective  that 
involves  second  derivatives  of  the  residual  functions.  We  have 

V2  ( \f(x)Tf(x ))  = 

where 

B(x)  = 

»'=  1 


•  72(A)  -  The  range  of  A. 

If  A  is  an  m  X  n  matrix,  then  72(A)  =  {6  €  ftm  |  Ax  =  6  for  some  x  G  ft"}  is  a 
subspace  of  ftm. 

•  Af{A)  -  The  null  space  of  A. 

If  A  is  an  m  X  n  matrix,  then  Af(A)  =  {x  6  ftn  t  Az  =  0}  is  a  subspace  of  ft”. 
Af{A)  is  the  orthogonal  complement  of  72(AT)  in  ft”. 


•••.'-.w.fwa 
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2.  Gauss-Newton  Methods 


The  classical  approach  to  nonlinear  least  squares,  called  the  Gauss-Newton  method, 
is  a  linesearch  method  in  which  the  search  direction  at  the  current  iterate  minimizes  the 
quadratic  function 

9kP+\pTJkJkP-  (2.1) 

The  function  (2.1)  is  a  local  approximation  to  1 1| f(xk  +p)|l2  -  2  Il/(x*)ll2  m  which 
each  residual  component  of  /  is  approximated  by  a  linear  function,  using  the  relationship 

f(xk+p)  =  f(xk)  +  J(xk)p+0(M2). 

As  a  model  for  the  change  in  the  least-squares  objective,  (2.1)  has  the  advantage  that 
it  involves  only  first  derivatives  of  the  residuals,  and  that  JT  J  is  always  at  least  positive 
semi-definite. 

The  Gauss-Newton  method  can  be  viewed  as  a  modification  of  Newton’s  method 
in  which  is  used  to  approximate  the  Hessian  matrix 

m 

+  =  jtj  +  b 

1=1 

of  the  nonlinear  least-squares  objective  function.  The  assumption  is  that  the  matrix 
JTJ  should  be  a  good  approximation  to  the  full  Hessian  when  the  residuals  are  small. 
In  fact,  if  /(**)  =  0  and  J(x*)TJ(x*)  is  positive  definite,  then  the  sequence  {**+?£*} 
is  locally  quadratically  convergent  to  z*,  because 

J(xk)TJ{xk)  =  \  V2|j/(x*)||2  +  0(||zfc  -  *11). 

For  more  convergence  results  and  detailed  convergence  analysis  for  the  Gauss-Newton 
method,  see,  e.  g.,  Chapter  10  of  Dennis  and  Schnabel  [1983],  Schaback  [1985],  and 
Haussler  [1986],  as  well  as  some  of  the  references  cited  below. 

McKeown  [1975a,  1975b]  studies  test  problems  of  the  form, 

(xT  Hi  x 

; 

xTHmx 

chosen  so  that  factors  affecting  the  rate  of  convergence  could  be  controlled.  He  uses 
three  such  problems,  with  seven  different  values  of  a  parameter  that  varies  an  asymptotic 
linear  convergence  factor.  The  algorithms  tested  include  some  quasi-Newton  methods 

4 


» 


for  unconstrained  optimization,  as  well  as  some  specialized  methods  for  nonlinear  least 
squares  that  have  since  been  superseded.  He  concludes  that,  when  the  asymptotic 
convergence  factor  is  small,  the  Gauss-Newton  method  is  more  efficient  than  the  quasi- 
Newton  methods,  but  that  the  opposite  is  true  when  the  asymptotic  convergence  factor 
is  large.  Fraley  [1967a;  1988b]  gives  numerical  results  for  some  Gauss- Newton  meth¬ 
ods  using  these  problems,  and  observes  that  the  Jacobian  is  well-conditioned  at  every 
iteration. 

A  difficulty  with  the  Gauss- Newton  method  arises  when  JTJ  is  singular,  or,  equiv¬ 
alently,  when  J  has  linearly  dependent  columns,  because  then  (2.1)  does  not  have  a 
unique  minimizer.  The  set  of  vectors  that  minimize  (2.1)  is  the  same  as  the  set  of 
solutions  to  the  linear  least-squares  problem 

min  \\Jkp  +  fk\ j2  .  (2.2) 

One  theoretically  well-defined  alternative  that  is  often  approximated  computationally  is 
to  require  the  unique  solution  of  minimum  / 2  norm : 

min||p||2,  (2.3) 

p€o 

where  S  is  the  set  of  solutions  to  (2.2).  Another  option  is  to  replace  J  in  (2.2)  by 
a  maximal  linearly  independent  subset  of  its  columns.  In  finite-precision  arithmetic, 
there  is  often  some  ambiguity  about  how  to  formulate  and  solve  an  alternative  to  (2.2) 
when  the  columns  of  J  are  "nearly”  linearly  dependent,  so  that,  from  a  computational 
standpoint,  any  particular  Gauss-Newton  method  must  be  viewed  as  a  class  of  methods. 
The  references  cited  above  for  linear  least  squares  discuss  at  length  the  difficulties 
inherent  in  computing  solutions  to  (2.2)  when  J  is  ill-conditioned,  and  show  that  the 
numerical  solution  of  these  problems  is  dependent  on  the  criteria  used  to  estimate  the 
rank  of  J.  For  a  survey  of  some  of  the  early  research  on  numerical  Gauss- Newton 
methods,  see  Dennis  [1977].  We  define  the  class  of  Gauss-newton  methods  to  include 
all  linesearch  methods  in  which  the  search  direction  is  the  result  of  some  well-defined 
computational  procedure  for  solving  (2.2). 

Most  often  in  Gauss- Newton  methods  the  nonlinear  least-squares  objective  is  used 
as  a  merit  function  for  the  linesearch .  If 

Pfc*  €  arg  min  p€K»gJp  +  ^  prJkJkp, 
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then  pkN  satisfies  the  equations 


JI  JkP=~9k,  (2.4) 

and  is  therefore  a  direction  of  descent  for  /T/  at  x*  whenever  gk  ^  0.  To  guaran¬ 
tee  convergence  to  a  local  minimum  in  a  linesearch  method,  the  sequence  of  search 
directions  must  also  be  bounded  away  from  orthogonality  to  the  gradient  of  the  merit 
function,  a  condition  that  may  not  be  met  by  successive  Gauss-Newton  directions  rela¬ 
tive  to  fTf  unless  the  eigenvalues  of  JTJ  are  bounded  away  from  zero.  Powell  [1970a] 
gives  an  example  of  convergence  of  a  Gauss- Newton  method  with  exact  linesearch  to  a 
non-stationary  point,  in  which  the  search  direction  becomes  orthogonal  to  a  non-zero 
gradient. 

Deuflhard  and  Apostolescu  [1980]  suggest  selecting  a  steplength  for  the  Gauss- 
Newton  direction  based  on  decreasing  the  merit  function  ||«/JJ/(zr)||2 ,  rather  than 
||/(x)||2,  for  a  class  of  nonlinear  least-squares  problems  that  includes  zero-residual  prob¬ 
lems.  The  function  j\  is  the  pseudo-inverse  of  Jk  (see,  e.  g.,  Chapter  6  of  Golub  and 
Van  Loan  [1983]),  and  j\fk  is  the  minmum  /2-norm  solution  to  ||J*:P  +  |[2 -  They 

reason  that  the  Gauss-Newton  direction  is  the  steepest-descent  direction  for  the  func¬ 
tion  m*/(z)]|2,  so  that  the  geometry  of  the  level  surfaces  defined  by  ||T^/(z)||2  is 
more  favorable  to  avoiding  small  steps  in  the  linesearch.  A  shortcoming  of  this  approach 
(pointed  out  by  the  authors)  is  that  there  are  no  global  convergence  results.  The  merit 
function  depends  on  x*.  so  that  a  different  function  is  being  reduced  at  each  step.  An¬ 
other  difficulty  is  that,  although  the  authors  state  that  numerical  experience  supports 
selection  of  a  steplength  based  on  ||</*/(a:)||2  for  ill-conditioned  problems,  the  transfor¬ 
mation  jl  is  not  numerically  well-defined  under  these  circumstances.  Therefore  neither 
the  Gauss-Newton  search  direction,  nor  the  merit  function,  is  numerically  well-defined 
when  the  columns  of  Jk  are  nearly  linearly  dependent.  Since  it  is  not  known  how  to 
improve  Gauss-Newton  methods  for  general  problem-  through  the  selection  of  merit 
functions  for  the  linesearch,  we  shall  henceforth  make  the  conventional  assumption  that 
the  linesearch  is  performed  relative  to  the  nonlinear  least  squares  objective. 

There  is  another  reason  why  it  is  difficult  to  say  precisely  what  is  meant  by  a 
"Gauss-Newton  method"  for  a  particular  nonlinear  least-squares  problem.  To  see  this, 
let  Q(x)  be  an  I  X  m  orthogonal  matrix  function  on  3Rn,  that  is,  Q(x)rQ(x)  =  I  for 
all  x.  Then  ||Q(x)/(x)|j2  =  l|/(ir)ll2  f°r  all  x,  and  consequently  the  function  /  =  Qf 
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defines  the  same  nonlinear  least-squares  problem  as  /.  The  Jacobian  matrix  of  /  is 
J  =  QJ  -f  (VQ)/,  so  that  a  minimizer  of  ||  Jp  +  f\\2  will  ordinarily  be  different  from  a 
minimizer  of  \\Jp  +  / 1|2,  unless  Q(x )  happens  to  be  a  constant  transformation.  However, 
if  both  Q  and  /  have  k  continuous  derivatives,  then  V*||$(x)/(x)||j  =  V'||/(x)j|2  for 
i  =  1,2, ...  ,k.  Letting  W  =  (VQ)f,  so  that  j  =  QJ  +  W,  we  have 

PJ  =  JTJ  +  (JTQrW  +  WtQ  J)  +  WTW , 

showing  that  the  Gauss- Newton  approximation  JT  J  to  the  full  Hessian  matrix  is  changed 
when  /  is  transformed  by  an  orthogonal  function  that  varies  with  x.  Thus,  with  exact 
arithmetic,  there  are  many  Gauss-Newton  methods  corresponding  to  a  given  vector 
function,  although  Newton's  method  remains  invariant  (see  also  Nocedal  and  Overton 
[1985],  p.  826).  In  fact,  each  step  of  a  Gauss-Newton  method  could  be  defined  by  a 
different  transformation  of  /.  Moreover,  the  conditioning  of  J  may  be  very  different 
from  that  of  J,  so  that,  for  example,  the  columns  of  J  might  be  strongly  independent, 
while  J  is  nearly  rank  deficient.  Since  the  number  of  rows  in  Q  may  be  greater  than  n, 
it  is  possible  to  imbed  the  given  nonlinear  least-squares  problem  in  a  larger  one. 

Although  it  is  known  that  Gauss- Newton  methods  do  not  work  well  under  all  cir¬ 
cumstances,  it  is  not  possible  to  say  anything  more  precise  about  the  method  when 
considering  large  and  varied  sets  of  test  problems.  Gauss-Newton  methods  are  of  practi¬ 
cal  interest  because  there  are  many  instances  in  which  they  work  very  well  in  comparison 
to  other  methods.  In  fact,  most  successful  specialized  approaches  to  nonlinear  least- 
squares  problems  are  based  to  some  extent  on  Gauss-Newton  methods  and  attempt  to 
exploit  this  behavior  whenever  possible.  However,  it  is  not  hard  to  find  cases  where 
Gauss-Newton  methods  perform  poorly,  so  that  they  cannot  be  successfully  applied  to 
general  nonlinear  least-squares  problems  without  modification. 

Fraley  [1987a,  1988b]  gives  numerical  results  for  a  large  set  of  test  problems  using 
widely-distributed  software  for  unconstrained  optimization  and  nonlinear  least  squares. 
She  also  includes  some  Gauss-Newton  methods  that  use  LSSOL  [Gill  et  al.  (1986a)]  to 
solve  the  linear  least-squares  subproblem  (2.2).  Her  findings  confirm  that  Gauss-Newton 
methods  are  often  among  the  best  available  techniques  for  nonlinear  least  squares  — 
especially  for  zero-residual  problems  —  but  that  there  are  many  cases  in  which  they 
fail  or  are  inefficient.  Detailed  examples  are  presented  in  Fraley  [1988b]  that  illustrate 
some  of  the  difficulties  involved  in  characterizing  those  problems  on  which  Gauss-Newton 
methods  will  or  will  not  work  well. 
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Many  attempts  have  been  made  to  define  algorithms  that  depart  from  the  Gauss- 
Newton  strategy  only  when  necessary.  Ramsin  and  Wedin  [1977]  use  the  steepest- 
descent  direction,  rather  than  the  Gauss-Newton  direction,  whenever  the  decrease  in 
the  objective  is  considered  unacceptably  small.  They  compare  the  performance  of  this 
Gauss-Newton-based  method  with  that  of  a  Levenberg-Marquardt  method  for  nonlinear 
least  squares  and  a  quasi-Newton  method  for  unconstrained  optimization,  both  from  the 
Harwell  Library.  The  quasi-Newton  routine  required  an  initial  estimate  Ho  of  the  Hessian 
matrix,  and  the  choice  Ho  =  J(xo)TJ(xo)  was  made  on  the  basis  of  preliminary  tests 
that  showed  equal  or  better  performance  compared  to  Ho  =  /.  The  test  problems  were 
constructed  so  that  asymptotic  properties  could  be  monitored  and  are  similar  to  those 
of  McKeown  [1975a,  1975b]  mentioned  above.  In  all  cases  considered,  the  Jacobian 
matrix  had  full  column  rank  at  the  solution.  The  experiments  involved  variation  of  a 
large  number  of  parameters.  Ramsin  and  Wedin  conclude  that  their  Gauss-Newton- 
based  method  and  the  Levenberg-Marquardt  method  are  identical  when  the  asymptotic 
convergence  factor  is  small,  but  that  neither  method  is  consistently  better  for  large 
asymptotic  convergence  factors.  Also,  they  find  that  in  instances  when  the  asymptotic 
convergence  factor  is  large,  the  quasi-Newton  method  may  be  more  efficient,  although 
superlinear  convergence  of  the  quasi-Newton  method  was  never  observed.  Ramsin  and 
Wedin  maintain  that  Gauss- Newton  should  not  be  used  when  (i)  the  current  iterate 
Xfc  is  close  to  the  solution  xm,  and  the  relative  decrease  in  the  size  of  the  gradient  is 
small,  (ii)  **  is  not  near  x',  and  the  decrease  in  the  sum  of  squares  relative  to  the 
size  of  the  gradient  is  small,  or  (Hi)  Jk  is  nearly  rank-deficient.  Conditions  (i)  and  (ii) 
are  indicators  of  inefficiency  for  any  minimization  algorithm.  Although  hybrid  methods 
do  exist  that  are  based  on  ascertaining  whether  or  not  the  current  iterate  is  close  to  a 
solution  (see  below),  a  drawback  of  these  approaches  is  that  they  rely  on  approximations 
to  asymptotic  relationships  and  are  not  sufficient  to  guarantee  proximity  to  a  minimum. 
Whether  the  parameters  defining  the  critical  conditions  can  be  chosen  in  such  a  way 
as  to  be  suitable  over  a  wide  range  of  problems  has  yet  to  be  demonstrated.  As  for 
condition  (it»),  rapidly  convergent  Gauss-Newton  methods  may  exist  even  if  nearly  rank- 
deficient  Jatobians  are  encountered,  but  it  is  appears  difficult  to  formulate  a  single  rule 
for  estimating  the  rank  of  the  Jacobian  that  is  satisfactory  for  all  such  problems  (see 
Fraley  [1988b]). 

Bard  [1970]  uses  the  eigenvalue  decomposition  of  J  to  solve  the  normal  equations 
(2.4).  In  order  to  ensure  a  positive-definite  system,  he  modifies  the  eigenvalues  if  their 
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magnitude  falls  below  a  certain  threshold.  In  addition,  his  implementations  include 
bounds  on  the  variables  that  are  enforced  by  adding  a  penalty  term  to  the  objective 
function.  He  compares  these  Gauss-Newton-based  methods  with  a  Levenberg-Marquardt 
method  (Section  3)  and  some  quasi-Newton  methods  for  unconstrained  optimization  on 
a  set  of  ten  test  problems  from  nonlinear  parameter  estimation.  He  finds  that  the 
Gauss-Newton-based  methods  are  more  efficient  in  terms  of  function  and  derivative 
evaluations  than  the  quasi-Newton  methods,  but  that  there  is  no  significant  difference 
in  the  relative  performance  of  the  Gauss-Newton-based  methods  and  the  Levenberg- 
Marquardt  method. 

Betts  [1976]  proposes  an  algorithm  that  combines  a  Gauss-Newton  method  with 
a  method  in  which  the  Gauss-Newton  approximate  Hessian  JTJ  is  augmented  by  a 
quasi-Newton  approximation  to  the  second-order  term  B  =  in  the  non¬ 

linear  least-squares  Hessian  (see  Section  5).  The  algorithm  starts  with  a  Gauss- Newton 
method,  and  then  switches  to  the  augmented  Hessian  when  it  is  believed  that  the  iterates 
are  near  the  solution.  The  criterion  for  the  switch  is 

llPkllj  <  «  (1  +  M2)>  (2-5) 

for  some  £  <  1.  Results  are  presented  for  the  hybrid  methods,  as  well  as  for  the 
underlying  Gauss- Newton  method  and  special  quasi- Newton  method  (see  Section  5), 
on  a  set  of  eleven  test  problems.  Betts  concludes  that  the  hybrid  method  is  superior, 
especially  on  problems  with  nonzero  residuals,  although  the  results  he  lists  in  his  tables 
do  not  all  have  the  same  value  of  £  in  (2.5).  Another  issue  that  is  not  clarified  is 
the  treatment  of  near-singularity  or  indefiniteness  in  the  quadratic  model  in  any  of  the 
methods  tested.  Also  the  test  (2.5)  is  not  sufficient  to  imply  that  the  Gauss-Newton 
iterates  are  in  the  vicinity  of  a  solution,  and  could  instead  indicate  inefficiency  in  the 
Gauss-Newton  method  at  some  arbitrary  point. 

Wedin  and  Lindstrom  [1988]  have  developed  an  algorithm  for  nonlinear  least-squares 
that  combines  a  Gauss- Newton  method  with  a  finite-difference  Newton  method.  A 
Gauss-Newton  search  direction  is  computed  at  every  iteration  using  a  QR  factorization 
with  column  pivoting  and  a  standard  rank-estimation  scheme.  The  ith  column  of  R  is 
replaced  by  a  column  of  zeros  in  the  factorization  if  the  i  diagonal  r„  satisfies 


|r<i|  <  oy/n |rn|, 
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where  a  is  a  fixed  tolerance.  However,  the  method  may  ultimately  decide  to  take  a  step 
along  a  Gauss- Newton  direction  for  which  the  effective  rank  is  less  than  the  original 
rank  estimate.  The  heuristics  for  determining  the  effective  rank  are  complicated,  and 
search  directions  for  several  different  values  of  the  effective  rank  may  be  tried  before 
a  step  is  actually  taken.  A  finite-difference  Newton  step  may  be  used  when  the  steps 
along  Gauss-Newton  directions  become  small,  and  the  iterates  are  judged  to  be  close  to 
a  solution.  In  the  algorithm,  the  decision  about  whether  z*  is  near  the  solution  is  based 
on  the  relation  ||/*||2  >  7 for  some  7  >  1,  where  /3*  is  the  norm  of  the  projection  of 
fk  onto  the  range  of  J* .  Note  that  depends  on  the  estimated  rank  of  the  Jacobian. 
The  ratio  flk/Pk- 1  is  used  as  an  estimate  of  an  asymptotic  linear  convergence  factor. 
They  give  numerical  results  for  a  set  of  thirty  large-residual  test  problems  constructed 
by  Al-Baali  and  Fletcher  [1985],  and  compare  their  results  with  those  given  by  Al-Baali 
and  Fletcher  for  two  hybrid  Gauss-Newton/BFGS  methods  and  a  version  of  NL2S0L 
(see  Dennis,  Gay,  and  Welsch  [1981  a,  bj).  Wedin  and  Lindstrom  find  that  their  method 
gives  better  overall  results  than  the  other  methods,  although  their  method  does  fail  in 
three  cases  due  to  a  finite-difference  Hessian  that  is  not  positive  definite. 

In  addition  to  those  described  above,  many  of  the  methods  discussed  in  subsequent 
sections  also  use  Gauss-Newton  search  directions  under  certain  circumstances. 

3.  Levenberg-Marquardt  Methods 

In  Levenberg-Marquardt  methods,  the  Gauss-Newton  quadratic  model  (2.1)  is  min¬ 
imized  subject  to  a  trust-region  constraint.  The  step  p  between  successive  iterates 
solves 

mm  gTp  +  |  pTJTJp  (3.1) 

subject  to  ||£>p||2  <  6, 

for  some  S  >  0  and  some  diagonal  scaling  matrix  D  with  positive  diagonal  entries. 
Equivalently,  p  minimizes  the  quadratic  model 

9TP  +  \  +  \DTD)p,  (3.2) 

for  some  A  >  0.  Since  the  matrix  JTJ  +  A DTD  is  positive  semidefinite,  minimizers  p\ 
of  (3.2)  satisfy  the  equations 

(JTJ  +  A  DrD)p  =-g  =  -JTf,  (3.3) 


10 


which  are  the  normal  equations  for  the  linear  least-squares  problem 

Pii».||(A)'-(o)ll  (34) 

Hence  a  regularization  method  (e.  g.f  Chapter  25  of  Lawson  and  Hanson  [1974],  Elden 
[1977,  1984],  Varah  [1979],  and  Gander  [1981])  is  being  used  to  solve  the  linear  least- 
squares  problem  (2.2)  for  the  step  to  the  next  iterate. 

The  paper  by  Levenberg  [1944]  is  the  earliest  known  reference  to  methods  of  this 
type.  Based  on  the  observation  that  the  unit  Gauss- Newton  step  paN  often  fails  to 
reduce  the  sum  of  squares  when  ||pow||  «  not  especially  small,  he  suggests  limiting  the 
size  of  the  search  direction  by  solving  a  "damped"  least-squares  subproblem, 

min  w(flTp  +  ~  pTJTJp)  +  H-Dpllj,  (3.5) 

in  which  a  weighted  sum  of  squares  of  linearized  residuals  and  components  of  the  search 
direction  is  minimized.  He  proves  the  existence  of  a  value  of  for  which 

ll/(*  +  Pw)||j  <  ll/OOHa, 

where  solves  (3.5),  thus  ensuring  a  reduction  in  the  sum  of  squares  for  a  suitable 
value  of  u.  A  major  drawback  is  that  no  automatic  procedure  is  given  for  obtaining  u. 
Levenberg  suggests  computing  the  value  of  \\f(x  +  pw)||2  for  several  trial  values  of  u>, 
locating  an  approximate  minimum  graphically,  and  then  repeating  this  procedure  with 
the  improved  estimates  until  a  satisfactory  value  of  u>  is  obtained,  but  precise  criteria 
for  accepting  a  trial  value  are  not  given.  Two  alternatives  are  proposed  for  the  diagonal 
scaling  matrix  D  in  (3.5) :  D  =  /,  because  it  minimizes  the  directional  derivative  gTpu 
for  u  ss  0,  and  the  square  root  of  the  diagonal  of  JTJ,  based  on  empirical  observations. 
The  claim  is  that  the  new  method  solves  a  wider  class  of  problems  than  methods  that 
existed  at  that  time,  and  that  it  does  so  with  relative  efficiency. 

Somewhat  later,  a  similar  method  was  (apparently  independently)  proposed.  Mor¬ 
rison  [1960]  considers  a  quadratic  model 

9rP+\pTHp,  (3.6) 

in  which  either  H  =  JT  J  or  H  =  V*  (T7 /)  (in  the  latter  case,  it  is  implicitly  as¬ 
sumed  that  V2  (/T/)  is  positive  semidefinite).  He  advocates  minimizing  (3.6)  over  a 
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neighborhood  of  the  current  point  as  does  Levenberg,  because  (3.6)  may  not  be  a  good 
approximation  to  ^  ^||/(x  +  p)||2  -  ||/(x)||2 )  if  the  minimizer  p*  is  large  in  magnitude, 
and  consequently  the  sum  of  squares  may  not  be  reduced  at  x  +  p*.  (In  Hartley  [1961], 
a  linesearch  is  used  with  the  Gauss-Newton  direction  for  the  same  reason.)  Morrison 
proves  that  the  solution  p\  to 

min  gTp  +  ^  pT(/f  +  \D)p 

for  A  >  0  is  the  constrained  minimum  of  (3.6)  on  the  sphere  of  radius  ||/?pa||2.  and 
that  IIpa||2  —  0  as  A  -»  oo.  In  Morrison’s  method,  the  step  bound  6  is  the  independent 
parameter,  rather  than  A.  No  specifications  are  given  for  either  6  or  D,  although  it  is 
implied  that  they  can  be  chosen  heuristically  for  a  given  problem.  Instead  of  minimizing 
(3.6)  subject  to  || Z?p|j2  <  6,  constraints  of  the  form  d{Xi  <  6  are  imposed,  and  the 
resulting  subproblem  is  then  solved  using  the  eigenvalue  decomposition  of  //.  Although 
the  theory  and  methods  apply  for  any  positive  semi-definite  H  in  (3.6),  no  generalization 
to  unconstrained  minimization  is  mentioned. 

Marquardt  [1963]  extended  Morrison's  work,  showing  that  the  vector  px  that  solves 
(3.3)  becomes  parallel  to  the  steepest-descent  direction  as  A  -*  oo,  so  that  p\  in¬ 
terpolates  between  the  Gauss-Newton  search  direction,  po,  and  the  steepest-descent 
direction,  poo.  He  points  out  that  the  method  determines  both  the  direction  from  the 
current  iterate  to  the  next  one,  and  the  distance  between  the  iterates  along  that  direc¬ 
tion,  and  that  increasing  A  decreases  the  step  length,  while  shifting  the  direction  away 
from  orthogonality  to  the  gradient  of  the  sum  of  squares.  Marquardt’s  strategy  con¬ 
trols  A  automatically  by  multiplying  or  dividing  the  current  value  by  a  constant  factor  v 
greater  than  1.  He  maintains  that  the  minimum  of  the  Gauss-Newton  model  should  be 
taken  over  the  largest  possible  neighborhood,  that  is,  that  A  should  be  chosen  as  small 
as  possible,  so  as  to  achieve  faster  convergence  by  biasing  the  search  direction  toward 
the  Gauss-Newton  direction  when  Gauss-Newton  methods  would  work  well.  Thus,  at 
the  fcth  iteration,  A*  =  Afc-i/t'  is  tried  first,  and  then  increased  if  necessary  by  mul¬ 
tiples  of  v  until  a  reduction  in  the  sum  of  squares  is  obtained.  A  shortcoming  of  this 
scheme  is  that  A  is  always  positive,  so  that  the  constraint  in  (3.1)  is  active  in  every 
subproblem,  and  consequently  a  full  Gauss- Newton  step  can  never  be  taken.  Also,  no 
efficient  method  is  given  for  solving  (3.3)  for  different  values  of  A.  Motivated  by  statis¬ 
tical  considerations,  Marquardt  uses  the  diagonal  of  JTJ  for  the  scaling  matrix  D  (one 
of  the  alternatives  proposed  by  Levenberg),  and  mentions  that  this  scaling  has  been 


widely  used  as  a  technique  for  computing  solutions  to  ill-conditioned  linear  least-squares 
problems. 

Since  the  appearance  of  Marquardt's  paper,  and  also  that  of  Goldfeld,  Quandt,  and 
Trotter  [1966],  which  independently  proposed  trust-region  methods  for  general  uncon¬ 
strained  optimization,  much  research  has  been  directed  toward  improvements  within  the 
framework  presented  there.  Bard  [1970]  takes  the  eigenvalue  decompostion  of  JTJ  at 
each  iteration,  so  that  (3.3)  can  be  easily  solved  for  several  values  of  A,  and  so  that 
it  will  be  known  whether  or  not  JT  J  is  singular.  Bartels,  Golub,  and  Saunders  [1970] 
show  how  to  use  the  SVD  of  J  instead  of  the  eigenvalue  decomposition  for  the  same 
purpose.  They  also  give  an  algorithm  for  computing  A  given  6  that  involves  determining 
some  eigenvalues  of  a  diagonal  matrix  after  a  symmetric  rank-one  update.  Meyer  [1970] 
discusses  the  use  of  a  linesearch  with  Marquardt’s  method  (see  also  Osborne  [1972]). 
Shanno  [1970]  selects  A  so  that  p\  is  a  descent  step  for  ||/(z)||2-  The  value  A  =  0 
is  tried  first,  and  then  increases  are  made  by  multiplying  a  threshold  value  by  a  factor 
greater  than  one  until  ^'(A)  <  0,  where  =  ||/(ar  +  Pa)!^-  In  addition,  a  linesearch 
is  also  used  when  cos(p\,g)  is  above  a  threshold  value,  that  is,  when  p\  is  judged  to 
be  nearly  in  the  direction  of  -g.  Shanno's  method  is  meant  for  general  unconstrained 
or  linearly-constrained  minimization,  as  well  as  for  nonlinear  least  squares. 

Several  methods  have  attempted  to  approximate  Levenberg-Marquardt  directions 
by  a  vector  that  is  the  sum  of  a  component  in  the  steepest  descent  direction,  and 
a  component  in  the  Gauss- Newton  direction  paN.  Jones  [1970]  combines  searches 
along  a  spiral  arc  connecting  pGN  and  the  origin  with  parabolic  interpolation  in  order 
to  obtain  a  decrease  in  the  sum  of  squares.  If  a  reduction  is  not  achieved  after  trying 
several  arcs,  then  the  steepest  descent  direction  is  searched.  The  method  of  Powell 
[1970a]  for  nonlinear  equations  and  [1970b]  for  unconstrained  optimization  searches 
along  a  piecewise  linear  curve.  The  algorithm  for  unconstrained  optimization  requires 
some  agreement  between  the  reduction  predicted  by  the  quadratic  model  and  the  actual 
reduction  in  the  sum  of  squares  before  the  step  is  accepted.  Global  convergence  results 
that  include  use  of  the  quadratic  model  (2.1)  for  nonlinear  least  squares  are  given  in 
Powell  [1975]  (see  also  Mori  [1983]).  Steen  and  Byrne  [1973]  approximate  a  search  along 
an  arc  that  intersects  g  at  a  nonzero  point.  Their  algorithm  requires  that  J  be  scaled 
so  that  its  smallest  eigenvalue  is  2,  which  they  accomplish  by  computing  ( JTJ)~ 1  and 
finding  either  ||(./T«/)“1||1  or  ||(/r«7)-1||oo.  A  diagonal  of  unspecified  small  magnitude 
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is  added  to  JTJ  in  the  event  of  singularity.  A  difficulty  with  any  algorithm  based  on 
this  type  of  approach  is  that  it  is  not  clear  how  to  define  the  approximation  when  the 
Gauss- Newton  direction  is  not  numerically  well  defined. 

Fletcher  [1971]  implements  a  modified  version  of  Marquardt’s  algorithm,  in  which 
adjustments  in  the  parameter  A  are  made  on  the  basis  of  a  comparison  of  the  actual 
reduction  in  the  sum  of  squares 

\  («/(*  + PA)ll3-[|/(*)|g).  (3-7) 

with  the  reduction 

gTp\  +  \pJjTJp\  (3-8) 

predicted  by  the  model  (3.2),  which  is  the  optimum  value  of  the  objective  in  (3.1)  (see 
also  Powell  [1970b]).  The  step  p *  is  taken  only  when  there  is  sufficient  agreement 
between  (3.7)  and  (3.8),  instead  of  accepting  p\  whenever  the  trial  step  results  in  a 
reduction  in  the  sum  of  squares.  Fletcher  also  introduces  more  complicated  techniques 
for  updating  A.  The  scheme  for  decreasing  A  differs  from  that  given  by  Marquardt  in 
that  division  by  a  constant  factor  is  used  only  until  A  reaches  a  threshold  value,  Ac, 
below  which  it  is  replaced  by  zero.  This  modification  is  motivated  by  a  desire  to  allow 
the  Gauss-Newton  step  (A  =  0)  when  Gauss-Newton  methods  would  work  well,  since 
A  is  always  positive  in  Marquardt's  method,  and  to  allow  the  initial  choice  of  A  =  0 
rather  than  some  arbitrary  positive  value.  Because  numerical  experiments  show  that 
multiplying  by  a  fixed  constant  factor  may  be  inefficient,  Fletcher  uses  safeguarded 
quadratic  interpolation  to  increase  A  when  (3.7)  and  (3.8)  differ  substantially.  If  the 
current  value  of  A  is  nonzero,  then  it  is  divided  by  a  factor 

{0.1,  if  amt„  <  0.1; 

®m«n>  if  <*m»n  6  [0.1, 0.5];  (3-9) 

0.5,  if  amin  >  0.5, 

where  amj„  is  the  minimum  of  the  quadratic  interpolant  to  the  function  4>{a)  = 
||/(x  +  ap)||j  at  ^(0),  $'(0),  and  <£(1).  There  is  also  a  provision  to  increase  A  =  0 
to  the  threshold  value  Ac  under  certain  circumstances.  The  choice  of  Ac  appears  to  be 
a  major  difficulty. 

Fletcher  gives  some  theoretical  justification  for  choosing  Ac  to  be  the  reciprocal 
of  the  smallest  eigenvalue  of  {JT J)~l .  Since  he  chooses  to  solve  (3.3)  directly  for 
each  value  of  A  via  the  Cholesky  factorization,  rather  than  compute  the  eigenvalue 
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decomposition  of  JTJ  or  the  singular  values  of  J,  the  minimum  eigenvalue  of  JTJ  is 
not  available  without  further  computation.  He  therefore  updates  the  estimate  of  Ac 
only  when  A  is  increased  from  0,  calculating  from  the  Cholesky  factorization 

of  JTJ,  and  then  takes  either  Ac  =  1/  ||(,/T.7)-1||oo,  or  Ac  =  1  /trace  ((7T./)-1).  A 
drawback  is  that  Ac  is  not  defined  when  JTJ  is  singular,  and  it  is  not  well  defined  when 
JTJ  is  ill-conditioned.  Harwell  subroutine  VA07A  is  an  implementation  of  Fletcher’s 
method.  It  allows  the  user  to  select  the  scaling  matrix  D,  which  then  remains  fixed 
throughout  the  computation.  The  default  for  the  scaling  matrix  is  the  square  root  of 
the  diagonal  of  JTJ  at  the  starting  value. 

An  efficient  and  stable  method  for  solving  (3.3)  for  several  values  of  A  based  on 
the  linear  least-squares  formulation  (3.4)  is  given  by  Osborne  [1972].  The  method  is 
accomplished  in  two  stages.  First,  the  QR  factorization  of  J  is  computed,  to  obtain 

(o  /)  (Ad)  =  {Ad)'  (31°) 

after  which  a  series  of  elementary  orthogonal  transformations  are  applied  to  reduce 
the  right-hand  side  of  (3.10)  to  triangular  form.  Thus  it  is  only  necessary  to  repeat 
the  second  stage  of  this  procedure  when  the  value  of  A  is  changed,  provided  the  QR 
factorization  of  J  is  saved.  In  a  later  paper,  Osborne  [1976]  discusses  a  variant  of 
Marquardt’s  algorithm  for  which  he  proves  global  convergence  to  a  stationary  point  of 
/T/  under  the  assumption  that  the  sequence  {A*}  remains  bounded.  In  this  method, 
he  uses  a  simple  scheme  similar  to  the  one  proposed  by  Marquardt  to  update  A,  but 
controls  adjustments  in  A  by  comparing  (3.7)  and  (3.8).  His  implementation  takes  D 
to  be  the  square  root  of  the  diagonal  of  JTJ,  as  in  Marquardt’s  method. 

The  algorithm  of  More  [1978]  adjusts  the  step  bound  6  in  (3.1)  rather  than  A,  a 
strategy  used  in  trust-region  methods  for  unconstrained  optimization  (see  More  [1983] 
for  a  survey).  Changes  in  6  depend  on  agreement  between  (3.7)  and  (3.8);  increases 
are  accomplished  by  taking  =  2j]Z?*jo*r|)2 ,  while  6  is  decreased  by  multiplying  by 
the  factor  7  defined  by  (3.9).  In  order  to  obtain  A  when  the  bound  in  (3.1)  is  active, 
the  nonlinear  equation 

•(A)  =  ||£>Pa||2  -  S  =  ||(JTy  +  A  DrD)~'  <7^  -6  =  0  (3.11) 

is  approximately  solved  by  truncating  a  safeguarded  Newton  method  based  on  the  work 
of  Hebden  [1973]  (see  also  Reinsch  [1971]).  Mord  reports  that,  on  the  average,  (3.11) 


is  solved  fewer  than  two  times  per  iteration.  Also,  he  proves  global  convergence  to 
a  stationary  point  of  fT f,  without  assuming  boundedness  for  {A*}.  Many  computa¬ 
tional  details  are  given,  including  an  efficient  method  for  calculating  the  derivative  of 
¥(A)  in  (3.11)  that  uses  the  QR  factorization  of  J.  A  modification  of  the  two-stage 
factorization  described  in  Osborne  [1972]  that  allows  column  pivoting  is  used  to  solve 
(3.3).  Subroutine  LMDER  in  MINPACK  [More,  Garbow,  and  Hillstrom  (1980)]  is  an 
implementation  of  the  method.  Variables  are  scaled  internally  in  LMDER  according  to 
the  following  scheme:  the  initial  scaling  matrix  Do  is  the  square  root  of  the  diagonal  of 
JT  J  evaluated  at  xq,  and  the  »th  diagonal  element  of  Dk  is  taken  to  be  the  maximum 
of  the  ith  diagonal  element  of  Dk-\  and  the  square  root  of  the  ith  diagonal  element 
of  JT  J.  Numerical  results  are  presented  indicating  that  this  scaling  compares  favorably 
with  those  used  by  Fletcher,  and  by  Marquardt  and  Osborne.  The  user  also  has  the 
option  of  providing  an  initial  diagonal  scaling  matrix  that  is  retained  throughout  the 
computation. 

Nazareth  [1980,  1983]  describes  a  hybrid  method  that  combines  a  Levenberg- 
Marquardt  method  with  a  quasi-Newton  approximation  Hk  to  the  full  Hessian.  The 
search  directions  solve  a  system  of  the  form 

(^kJj  J k  "F  (1  —  0k)Hk  +  A *77*  Dk)  p  —  —Qkt 

with  6 k  €  [0, 1]  and  A*  >  0.  He  compares  the  reduction  in  the  sum  of  squares  predicted 
by  both  the  levenberg-Marquardt  and  quasi-Newton  models  with  the  actual  reduction, 
and  then  chooses  0k  on  the  basis  of  this  comparison.  In  Nazareth  [1983],  a  simple 
version  of  the  hybrid  strategy  is  implemented  that  uses  Davidon’s  optimally  conditioned 
update,  with  Dk  =  /,  and  a  variation  of  Fletcher's  [1971]  method  for  updating  A. 
Results  are  reported  for  a  set  of  eleven  test  problems  —  including  five  problems  with 
nonzero  residuals  —  and  compared  to  the  use  of  the  algorithm  as  a  quasi-Newton  method 
(Ok  —  0)  or  a  Levenberg-Marquardt  method  (0k  =  1).  He  concludes  that  the  hybrid 
method  is  somewhat  better  for  the  problems  with  nonzero  residuals,  and  recommends 
development  of  a  more  sophisticated  implementation. 

4.  Corrected  Gauss-Newton  Methods 

Gill  and  Murray  [1976]  propose  a  finesearch  algorithm  that  divides  JJn  into  comple¬ 
mentary  subspaces  fl  and  A/1,  where  72  C  72(7T),  and  f[  is  nearly  orthogonal  to  7 Z(JT). 
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The  search  direction  is  the  sum  of  a  Gauss-Newton  direction  in  71,  and  a  projected  New¬ 
ton  direction  in  Sf.  This  strategy  avoids  a  shortcoming  of  Gauss-Newton  methods  — 
that  components  of  the  search  direction  that  are  nearly  orthogonal  to  R(Jr)  may  not 
be  well  determined  when  J  is  ill-conditioned  —  because  each  component  is  computed 
from  a  reasonably  well-conditioned  subproblem.  The  vector  x  -  x*  may  become  almost 
entirely  in  R(JT)  in  a  Gauss-Newton  method,  yet  the  algorithm  computes  a  search 
direction  that  is  virtually  orthogonal  to  7£(JT)  due  to  ill  conditioning  in  the  Jacobian 
(see  Fraley  [1987b]).  Gill  and  Murray  show  that  both  Gauss-Newton  algorithms  de¬ 
fined  by  (2.3)  and  Levenberg-Marquardt  algorithms  generate  search  directions  that  lie 
in  77(Jt),  while  the  Newton  search  direction  generally  will  have  a  component  in  M(  J), 
the  orthogonal  complement  of  72(JT),  whenever  J  has  linearly  dependent  columns.  For 
problems  with  small  residuals,  they  point  out  that  JT  J  is  a  reasonable  approximation 
to  the  full  Hessian  in  72(JT),  but  not  in  N(J).  Thus,  in  situations  where  x  -  x*  is 
orthogonal  to  7J(JT),  and  J  is  well-conditioned  but  has  linearly  dependent  columns  (for 
example,  when  m  <  n),  the  Gauss-Newton  and  Levenberg-Marquardt  directions  have 
no  component  in  the  direction  of  z  -  x*,  while  Newton's  method  and  also  the  method 
of  Gill  and  Murray  would  have  components  in  both  JT)  and  A f(J). 

The  basic  idea  of  the  method  is  as  follows.  Suppose  that 

J  =  QTVt  (4.1) 

is  an  orthogonal  factorization  of  J,  in  which  T  is  triangular  with  diagonal  elements  in 
decreasing  order  of  magnitude  (either  a  QR  factorization  with  column  pivoting  or  the 
singular-value  decomposition).  Let 


V  =  (Y  Z)  (4.2) 

be  a  partition  of  V  into  the  first  grade(J)  columns  and  the  remaining  n  —  grade(J) 
columns.  The  columns  of  Y  form  an  orthonormal  basis  for  7Z,  and  those  of  Z  form  an 
orthonormal  basis  for  N.  The  Newton  search  direction  for  the  nonlinear  least-squares 
problem  is  given  by 

(■ JTJ  +  B)p  =  -  JTf, 

with 

m 

t=i 
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or,  equivalently, 


VT(JTJ  +  B)p  =  -VTJTf,  (4.3) 

since  V  is  nonsingular.  Using  (4.2),  equation  (4.3)  can  be  split  into  two  equations: 

yT(JT  J  +  B)p  =  -YrJTf,  (4.4) 

and 

ZT(JTJ  +  B)p  =  -ZTJTf.  (4.5) 

Substituting  p  =  Y pY  4-  Zpz  into  (4.4)  yields 

YtJtJYPy  +  YTJTJZpz  +  YTBp  =  -YTJTf. 

Since  grade(J)  is  chosen  to  approximate  rank(J),  \\JZ\\  is  presumed  to  be  zero,  so 
that  VT  JT / Zpz  vanishes.  Also,  for  zero  residual  problems,  the  term  YTBp  would  be 
small  near  a  minimum  relative  to  YTJrJYpY,  since  [ji?|[  approaches  zero.  Defining  € 
to  be  ||x  —  i* ||,  where  z*  is  a  minimum  at  which  the  residuals  are  zero,  and  assuming 
||/||  =  O(c)  we  have 

YTJTJYpY  =  0(e);  YrBp  =  0(e2);  YTJTf  =  0(e). 

The  range-space  component  of  the  search  direction  is  therefore  chosen  to  satisfy 

YTJTJYpy  =  -YTJTf.  (4.6) 

With  grade(J)  =  rank(J),  the  vector  YpY  is  the  minimal  Z2-norm  least-squares  solution 
to  Jp  as  — /,  and  is  therefore  a  Gauss-Newton  direction.  For  the  null-space  portion, 
since  JZ  =  0  is  assumed,  (4.6)  reduces  to 

ZTBp  =  0, 

which  may  be  solved  for  Zpz  given  Y pY  from  (4.5)  using 

ZTBZpz  =  -ZtBYpy.  (4.7) 

When  exact  second  derivatives  are  not  available,  the  use  of  finite  difference  approxima¬ 
tions  along  the  columns  of  Z  is  suggested. 

A  version  of  this  algorithm  called  the  corrected  G&uss-Newton  method  [Gill  and 
Murray  (1978)]  forms  the  basis  for  the  nonlinear  least-squares  software  currently  in  the 
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NAG  Library  [1984].  It  us«s  the  singular-value  decomposition  of  J ,  rather  than  a  QR 
factorization.  Rules  based  on  the  relative  size  of  the  singular  values  are  given  for  choosing 
an  integer  grade(J)  to  approximate  rank(J),  and  an  attempt  is  made  to  group  together 
singular  values  that  are  similar  in  magnitude.  The  method  is  not  as  sensitive  to  grade(J) 
as  Gauss-Newton  is  to  rank  estimation,  both  because  of  the  division  of  the  computation 
of  the  search  direction  into  separate  components  in  fi  and  and  because  grade(J) 
is  varied  adaptively  based  on  a  measure  of  the  progress  of  the  minimization.  Moreover, 
the  rate  of  convergence  is  potentially  faster  than  Gauss-Newton  or  Levenberg-Marquardt 
methods  on  problems  with  nonzero  residuals.  The  quantity  grade(J)  is  reduced  when 
the  sum  of  squares  is  not  adequately  decreasing,  so  that  there  is  the  potential  of  having 
N  =  R"  (with  exact  second  derivatives,  this  implies  taking  full  Newton  steps)  in  the 
vicinity  of  a  solution.  The  derivation  below  shows  how  the  corrected  Gauss-Newton 
method  differs  from  the  earlier  version  based  on  the  QR  factorization. 

Because  of  (4.1),  JrJ  can  be  written  as  VTtTVt,  so  that  (4.3)  is  equivalent  to 

TtTVtp  +  V^Bp  =  -TtQt  f.  (4.8) 

Using  p  =  Y pY  +  Zpz,  along  with 

VtY  =  f^rad^jA  and  vTZ=(r  0  V 
V  ®  /  \  grade(J)  J 

(4.8)  becomes 

ttt  ^  IgraMJ)  ^  py  +  TrT  ^  ^ )  Pi  +  VT Bp  =  -TTQTf.  (4.9) 

If  we  let 


be  a  partition  of  T,  where  Tn  is  the  submatrix  consisting  of  the  first  k  rows  and  columns 
of  T,  then 

rpirp  _  ( {Ti\Tu  +  2$r„)  (TuT12  +  T2lT22)\ 

VPfrr,!  +  t^Th)  {Tj ^t12  +  t?2t22)J  ’ 

and  (4.9)  can  be  split  into  two  equations  : 

(T^Tn  +  T£T2l)pY  +  (TST12  +  TlTn)pz  +  YTBp  =  -  ( T&  7$  )  QTf ,  (4.10) 

and 

(T1t2Tu  +  T22T2l)pY  +  (T?2Tl2  +  7?2T22)pz  +  ZTBp  =  - (T12  T2t2)Qt/.  (4.11) 
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A*  in  the  earlier  version,  the  term  YTBp  is  ignored  in  (4.10).  Moreover,  in  the  case 
that  (4.1)  is  the  singular-value  decomposition,  both  t12  and  T2  i  vanish  and  the  two 
equations  can  be  further  simplified  to 

S*pY=-(Sl  0  )QTf,  (4.12) 

and 

Slpz  +  ZTBp  =  -  ( 0  S2)  QTf,  (4.13) 

where 

S\  =  Tu  and  S2  =  T22. 

Note  that  S2  and  S2  are  diagonal  matrices,  and  that  the  py  term  in  the  second  equa¬ 
tion  could  not  be  ignored  if  (4.1)  were  a  triangular  factorization  of  J,  because  then 
(T22Tl2  +  T22T2l)  could  not  be  assumed  negligible  relative  to  (T^TI2  +  Tj2T22).  The 
equations  that  are  ultimately  solved  are 

SiPy  =  ~(Igr*de(J)  0  )QT/»  (4-14) 

and 

(Si  +  ZTBZ)pt  =  -(0  S2)QTf-  ZTBpY.  (4.15) 

The  matrix  S2  +  ZTBZ  is  replaced  by  a  modified  Cholesky  factorization  if  it  is  compu¬ 
tationally  singular  or  indefinite.  The  range-space  component  is  a  Gauss- Newton  search 
direction,  while,  in  the  positive-definite  case,  the  null-space  component  is  a  projected 
Newton  direction. 

When  no  modification  is  necessary,  the  subproblem  being  solved  is 

min  gTp  +  \  pT(JTJ  +  B)p  (4.16) 

p€S  i 

subject  to  Jp  ~  -  /, 

where  is  taken  in  a  least-squares  sense  rf  the  rows  of  J  are  linearly  dependent,  as 
in  the  case  when  m  >  n,  and  otherwise  as  equality.  Subproblem  (4.16)  is  an  equality 
constrained  quadratic  program.  When  rank(J)  =  grade(J)  =  n,  its  solution  is  a  full- 
rank  Gauss-Newton  direction  that  is  completely  determined  by  the  constraints  in  (4.16). 
When  rank(J)  =  grade(J)  <  n,  the  search  direction  is  computed  as  the  sum  of  two 
mutually  orthogonal  components,  defined  by  equations  (4.14)  and  (4.15).  In  this  case 
S2  =  0,  so  that  the  projected  Hessian  in  (4.15)  is  ZrBZ  and  therefore  involves  only 
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the  second  derivatives  of  the  residuals.  We  shall  return  to  this  point  in  Section  7,  when 
we  discuss  SQP  methods  for  nonlinear  least  squares. 

Although  the  range-space  component  solving  (4.14)  can  never  be  a  direction  of  in¬ 
crease  for  /T/  (see  Fraley  [1987a]),  the  search  direction  computed  by  (4.14)  and  (4.15) 
may  not  be  a  descent  direction  for  /T/,  regardless  of  whether  or  not  Sj  +  ZT5Z  is 
modifed,  on  account  of  the  pY  term  in  (4.15).  Thus,  if  |cos(^,p)|  is  smaller  than  some 
prescribed  value,  or  if  grp  is  positive,  then  a  modified  Newton  search  direction  (corre¬ 
sponding  to  the  case  grade(J)  =  0)  is  used  instead.  A  finite-difference  approximation  to 
the  projected  matrix  ZTBZ  along  the  columns  of  Z,  and  a  quasi-Newton  approximation 
to  B  (see  the  discussion  in  Section  5)  are  given  as  alternatives  to  handle  cases  in  which 
second  derivatives  of  the  residual  functions  are  not  available  or  are  difficult  to  compute. 
Gill  and  Murray  test  their  method  on  a  set  of  twenty-three  problems,  and  find  that  when 
quasi- Newton  approximations  to  B  are  used,  the  algorithm  does  not  perform  as  well  as 
it  does  with  exact  second  derivatives  or  finite-difference  approximations  to  a  projection 
of  B.  They  observe  only  linear  convergence  for  the  quasi- Newton  version  on  problems 
with  large  residuals.  The  algorithms  are  implemented  in  the  NAG  Library  [1984]  as 
subroutine  E04HEF  which  uses  exact  second  derivatives,  and  subroutine  E04GBF  which 
is  the  quasi-Newton  version. 

5.  Special  Quasi-Newton  Methods 

Another  approach  to  the  nonlinear  least-squares  problem  is  a  based  on  a  quadratic 
model 

9TP  +  ^PT(JTJ  +  B)p, 

where  B  involves  quasi- Newton  approximations  to  the  term 

m 

B(x)  =  £>(*)VVi(*) 

1=1 

in  the  Hessian  of  the  nonlinear  least-squares  objective.  Brown  and  Dennis  [1971]  first 
proposed  a  method  in  which  the  Hessian  matrix  of  each  of  the  residuals  was  updated 
separately.  This  technique  is  impractical  because  it  entails  the  storage  of  m  symmetric 
matrices  of  order  n,  and  more  recent  research  has  aimed  to  approximate  B  as  a  sum. 

Dennis  [1973]  suggests  choosing  the  updates  to  satisfy  a  quasi-Newton  condition 

Bk+lsk  =  Vk  ~  l^k+l  (5.1) 
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where 


Sfc  =  xk+ 1  -  xk  and  yk  =  gk+i  -  gk- 


It  is  implied  that  the  update  can  then  be  chosen  as  in  the  unconstrained  case,  although 
there  is  some  ambiguity  as  to  how  this  should  be  done.  One  possibility  is  to  update  Bk 
directly  to  obtain  2?*+1,  subject  to  a  quasi-Newton  condition  such  as  (5.1)  on  Bk+iSk- 
Another  approach  consistent  with  Dennis'  description  is  to  modify  Hk  =  Jk+iJk+i  +  B*' 
requiring  the  updated  matrix  H  k+ 1  to  satisfy  a  quasi-Newton  condition 


Hk+\sk  =  J Ik-  (5.2) 

Then  Bk+i  —  H  k+  i  ~Jk+\Jk+i  i*  the  new  approximation  to  B  at  x*+i.  Depending  on 
the  update  and  quasi-Newton  conditions,  the  two  alternatives  may  not  yield  the  same 
result.  Moreover,  updates  defined  by  minimizing  the  change  in  the  inverse  of  J5*.  such 
as  the  BFGS  update  to  Bk,  make  no  sense  in  this  context,  since  the  matrix  B  would 
not,  by  itself,  be  expected  to  be  invertible. 

Betts  [1976]  implements  a  linesearch  method  in  which  the  symmetric  rank-one 
update  (see  Dennis  and  Mord  [1977])  is  applied  to  B,  with  the  quasi- Newton  condition 

Bk+i*k  —  Vk  ~  Jk  Jkak-  (5.3) 

This  scheme  is  equivalent  to  applying  the  symmetric  rank-one  formula  to  the  matrix 
Hk  —  ^k  +  Hk  with  the  updated  matrix  Bk+i  satisfying  (5.2),  and  then  taking 
Bk+i  —  H k+i  —  JjJk-  He  compares  this  algorithm  with  a  Gauss-Newton  method,  and 
also  with  a  hybrid  algorithm  that  starts  with  Gauss-Newton,  switching  to  the  augmented 
Hessian  Hk  when  the  iterates  are  judged  to  be  sufficiently  close  together  to  be  near  a 
solution.  It  is  not  clear  whether  the  update  is  performed  when  B  is  not  used  in  the  hybrid 
method.  Betts  reports  observing  quadratic  convergence  for  the  special  quasi-Newton 
methods.  For  further  discussion  of  these  results,  see  Section  2. 

Bartholomew- Biggs  [1977]  compares  the  PSB  update  (see  Dennis  and  More  [1977]) 
and  the  symmetric  rank-one  update  applied  directly  to  B  in  a  linesearch  method.  These 
updates  are  tested  with  the  quasi-Newton  condition  (5.1),  as  well  as  with  the  condition 

Bk+lSk  -  Jk+\fk+\  ~  Jkfk+U  (5.4) 
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which  is  derived  from  the  relation 


mm 

[v<£i(x*+l)  -  V^i(xfc)  +  0(||Sk||2)j 

i=l 

~  Jj+lfk+l  “  Jk  fk+ 1 


»=1 


(see  also  Dennis  [1976]).  Bartholomew- Biggs  points  out  that,  in  general,  quasi-Newton 
approximations  to  B  may  not  adequately  reflect  changes  that  are  due  to  the  contribution 
of  the  residuals.  For  example,  when  each  residual  function  <j>i  is  quadratic,  and  conse¬ 
quently  each  V2<j>i  is  constant,  Bk+ 1  may  differ  from  5jt  by  a  matrix  of  rank  n.  For  this 
reason,  he  does  some  experiments  with  updating  rBk  for  r  =  /J+1/*/ fj 5k,  which  is 
the  appropriate  scaling  for  the  special  case  in  which  fk+\  =  rfk  and  the  <f>i  are  quadratic. 
In  his  implementation,  a  Levenberg-Marquardt  step  is  used  whenever  the  linesearch  fails 
to  produce  an  acceptable  reduction  in  the  sum  of  squares  and  cos(g,p)  >  — 10~4. 
The  scaled  symmetric  rank-one  update  with  (5.4)  is  selected  to  compare  with  other 
methods  after  preliminary  tests,  because  it  exhibited  the  best  overall  performance,  and 
required  fewer  Levenberg-Marquardt  steps.  The  other  methods  tested  include  a  Gauss- 
Newton  method,  a  method  that  combines  Gauss-Newton  with  a  Levenberg-Marquardt 
method,  an  implementation  of  Fletcher's  [1971]  Levenberg-Marquardt  method,  and  a 
quasi-Newton  method  for  unconstrained  optimization.  All  of  the  fourteen  test  problems 
have  nonzero  residuals.  Bartholomew- Biggs  finds  that  the  special  quasi- Newton  method 
is  more  robust  than  the  other  specialized  methods  for  nonlinear  least-squares,  and  that 
it  is  particularly  suitable  for  problems  with  large  residuals.  He  also  observes  that  on 
problems  on  which  the  Gauss-Newton  and  Levenberg-Marquardt-based  methods  per¬ 
form  poorly,  the  special  quasi-Newton  method  is  more  effective  than  the  quasi-Newton 
method  for  general  unconstrained  optimization.  Nothing  is  said  about  the  observed  rate 
of  convergence  for  any  of  the  methods.  He  concludes  that  further  research  is  needed  to 
determine  the  best  updating  strategy,  some  desirable  features  being  hereditary  positive 
definiteness,  and  the  ability  to  update  a  factorization  of  B.  Finally,  he  indicates  that 
it  would  be  worthwhile  to  develop  a  hybrid  method  combining  Gauss-Newton  with  a 
special  quasi-Newton  method,  in  order  to  avoid  the  cost  of  the  updates  on  problems 
that  are  easily  solved  by  Gauss- Newton  methods. 


Gill  and  Murray  [1978]  discuss  a  linesearch  method  in  which  they  use  the  augmented 
Gauss- Newton  quadratic  model  only  to  compute  a  component  of  the  search  direction  in  a 
subspace  that  approximates  the  null  space  of  the  Jacobian  (see  the  preceding  section). 
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They  apply  the  BFGS  formula  for  unconstrained  optimization  (see  Dennis  and  More 
(1977])  to  the  matrix  Hk  =  Jj+tJk+x  +  Bk  with  the  quasi-Newton  condition  (5.2),  and 
then  form  Bk+ 1  =  ffk+i  -  Jk+iJk+i-  The  choice  of  the  BFGS  update  is  based  on 
performance  comparisons  to  a  number  of  other  updates,  including  the  symmetric  rank- 
one  update  and  Davidon's  optimally- conditioned  update  [Davidon  (1975)],  as  well  as  the 
symmetric  rank-one  update  applied  to  Hk  =  JkJk+Bk  used  in  Betts  [1976].  They  point 
out  that,  if  Jj+iJk+i  +  Bk  is  positive  definite,  and  yjsk  >  0,  then  Jk+1  Jfc+1  +  B/b+i  is 
also  positive  definite  with  this  scheme.  In  order  to  safeguard  the  method,  the  projected 
approximate  Hessian  is  replaced  by  a  modified  Cholesky  factorization  when  it  is  singular 
or  indefinite.  In  addition,  if  cos(p,g)  exceeds  a  fixed  threshold  value,  a  modified  Newton 
step  with  the  full  augmented  approximate  Hessian  is  taken.  See  Section  4  for  a  summary 
of  their  observations  on  the  performance  of  the  methods. 

Dennis,  Gay,  and  Welsch  [1981a]  apply  a  scaled  DFP  update  (see  Dennis  and  More 


[1977])  to  Bk  at  each  step.  The  new  approximation  Bk+\  solves 

min  || H~1/2(rkBk  -  B)H~^2\\F 

(5.5) 

subject  to 

Hsk  -  yk;  H  positive  definite 

(5.6) 

Bsk  =  jk+1fk+ 1  -  jkfk+ 1  ;  B  symmetric, 

(5.7) 

where 

Tk  =  min{ly?sk/sjBk*kl,l}- 

(5.8) 

The  scale  factor  r*  is  based  on  the  observation  that  the  quasi- Newton  approximation 
to  B  is  often  too  large  with  the  unsealed  update,  on  account  of  the  contribution  of 
the  residuals.  The  term  \yJsk/sjBksk\  in  rk  is  derived  from  the  self-scaling  principles 
for  quasi- Newton  methods  of  Oren  [1973],  and  attempts  to  shift  the  eigenvalues  of  the 
approximation  Bk  to  overlap  with  those  of  Bk,  using  new  curvature  information  at  x*. 
This  method  forms  the  basis  for  the  ACM  computer  program  NL2S0L  [Dennis,  Gay,  and 
Welsch  (1981b)],  which  is  distributed  by  the  PORT  Library  [1984]  as  subroutines  N2G 
and  DN2G.  It  is  implemented  as  an  adaptive  method,  in  that  Gauss- Newton  steps  are 
taken  if  the  Gauss- Newton  quadratic  model  predicts  the  reduction  in  the  function  better 
than  the  quadratic  model  that  includes  the  term  involving  B.  A  trust-region  strategy 
is  used  to  enforce  global  convergence.  Numerical  results  are  given  in  Dennis,  Gay,  and 
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Welsch  [1981a]  for  a  set  of  twenty-four  test  problems,  many  with  two  or  three  different 
starting  values. 

Al-Baali  and  Fletcher  [1985]  describe  some  linesearch  methods  that  are  similar  to 
the  method  of  Dennis,  Gay,  and  Welsch  [1981a]  discussed  above.  They  observe  that 
the  DFP  update  defined  by  (5.5)  -  (5.8)  is  equivalent  to  finding  Hk+i  to  *°lve 


subject  to 


min  Hff-^VZnA+i  +  +rkBk  -  H)H~XI2 \\F  (5.9) 

H\H 


Hsk  =  j/fc  ;  H  positive  definite  (5.10) 

Hsk  =  Jk+ifk+ 1  -  Jjfk+1  +  Jk+iJk+iak ;  &  symmetric, 


where 


rfc  =  Tmn{\yJsk/slBk8k\,l}, 


and  then  forming 


Bk+i  =  Hk+ 1  - 

Moreover,  they  use  the  condition 


Hsk  =  Jfki  H  positive  definite,  (5.11) 

with 

J Ik  =  Jk+lJk+lsk  +  Jk+\fk+l  -  Jkfk+l  =  yk  +  C>(||s*||2)  (5.12) 

as  an  alternative  to  (5.10),  and  mention  that  (5.10)  has  been  replaced  by  (5.11)  in 
newer  versions  of  NL2S0L.  The  claim  is  that  the  updated  matrix  is  almost  always  pos¬ 
itive  definite.  However,  if  the  matrix  Jk+iJk+\  +  rkBk  is  not  positive  semi-definite, 
rk  is  replaced  by  a  quantity  fk  that  is  calculated  by  a  method  similar  to  a  Rayleigh 
quotient  iteration,  so  that  jJ+17t+1  +  fkBk  is  positive  semi-definite  and  singular.  A 
corresponding  BFGS  method  is  also  given  in  which  the  update  is  defined  by 

min  ||/T1/2((.JfcVfc+1  +  TkBk)~l  - 

H\H 

instead  of  (5.9).  They  conclude  from  computational  tests  (described  in  Al-Baali  [1984]) 
that  their  method  is  somewhat  more  efficient  in  terms  of  the  number  of  Jacobian  eval¬ 
uations  than  NL2SGL,  but  requires  more  function  evaluations,  and  that  there  is  no  sig¬ 
nificant  difference  between  the  DFP  and  BFGS  updates.  Al-Baali  and  Fletcher  also 
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introduce  scaling  factors  based  on  finding  a  measure  of  the  error  in  the  inverse  Hessian. 
They  observe  that,  for  the  BFGS  update  for  unconstrained  optimization, 

II =  A k(Hk]yk), 

where  _  2 

Hence  an  "optimal”  value  of  r  can  be  found  by  minimizing  Jk+\  +  rBk)  as 

a  function  of  r.  Newton's  method  is  used  to  find  r,  an  iterative  process  that  requires 
factorization  of  Jj+i^k+i  +tBi,  for  each  intermediate  value  of  r.  They  were  apparently 
unable  to  draw  any  broad  conclusions  from  numerical  experiments  with  this  scaling,  and 
refer  to  Al-Baali  [1984]  for  details. 

A  convergence  analysis  for  minimization  algorithms  based  on  a  quadratic  model  in 
which  part  of  the  Hessian  is  computed  by  a  quasi-Newton  method  is  given  by  Dennis  and 
Walker  [1981]  (see  also  Chapter  11  of  Dennis  and  Schnabel  [1983]).  These  results  are 
restricted  to  methods  that  satisfy  a  least-change  condition  on  the  matrix  Bk  (analogous 
to  the  PSB  and  DFP  updates).  Only  a  fairly  mild  assumption  is  needed  to  prove 
superlinear  convergence  to  an  isolated  local  minimum  x * :  that  the  vector  y®  in  the 
quasi- Newton  condition 

BkSk  =  y* 

be  chosen  so  that  the  norm  of  the  update  is 

0(max{||x*  -  z*||'’,||xfc+1  -  x*||p}), 

for  some  p  >  0.  This  assumption  is  satisfied  for  y®  in  each  quasi-Newton  update 
to  Bk  described  above.  Their  treatment  of  inverse  updates  is  for  the  case  in  which 
part  of  the  inverse  Hessian  is  computed,  and  hence  does  not  apply  here.  To  the  best 
of  our  knowledge,  no  convergence  results  have  yet  been  proven  for  scaled  versions  of 
the  updates,  or  for  updates  to  Jj+i^k+i  +  Bk  that  are  not  equivalent  to  some  direct 
quasi- Newton  update  to  Bk- 

6.  Conjugate-Gradient  Acceleration  of  Gauss-Newton  Methods 

Ruhe  [1979]  uses  preconditioned  conjugate  gradients  to  speed  up  convergence  of 
Gauss-Newton  methods.  General  references  on  conjugate  gradients  include  Fletcher 
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[1980],  Chapter  4,  and  Gill,  Murray,  and  Wright  [1981],  Chapter  4.  We  give  a  brief 
explanation  below. 

The  linear  conjugate  gradient  method  minimizes  an  n-variate  quadratic  function 

<3(x)  =  qTp  +  ^pTIIp, 
in  at  most  n  iterations.  The  iteration  is 


where 


Pk  =  ~9k  +  fik-iPk-W 
xa+i  =  xk  +  akpk 


llfffc+i  lls 

IM} 


(6.1) 


9k  =  =  q  +  IIxk+ 1. 

The  method  produces  a  sequence  of  search  directions  that  are  // -conjugate,  that  is 

Pillpj  =  0  if  i^j. 

The  number  of  iterations  needed  to  minimize  Q  by  conjugate  gradients  (with  exact  arith¬ 
metic)  is  equal  to  the  number  of  distinct  eigenvalues  of  II.  The  idea  of  preconditioning 
is  to  transform  II  into  a  matrix  whose  eigenvalues  are  nearly  identical  in  magnitude.  If 
a  positive-definite  matrix  II  is  used  as  a  preconditioner,  then  convergence  occurs  in 
the  same  number  of  steps  that  would  be  taken  for  a  quadratic  function  with  the  Hessian 
matrix 

|(/-1/2//jr-,/2 

The  ideal  preconditioner  would  be  11'  =  //,  but  since  conjugate  gradients  are  competitive 
mainly  when  n  is  large,  an  approximation  that  is  relatively  inexpensive  to  factorize  is 
used.  For  a  smooth  nonlinear  function  ^(z),  the  conjugate  gradient  method  (6.1)  can 
also  be  applied,  with  gk  =  V /"(£*.)  and  ak  determined  by  a  linesearch,  with  safeguards 
to  ensure  descent.  There  are  several  possible  choices  for  (3k  that  are  equivalent  to  the  one 
given  above  for  the  quadratic  case  (see,  e.  g„  Fletcher  [1981],  Chapter  4).  The  method 
is  often  restarted  every  n  iterations  on  account  of  the  variation  in  V2T(x)  for  non¬ 
quadratic  functions  (e.  g„  Gill,  Murray,  and  Wright  [1981],  Chapter  4).  Preconditioners 
for  the  non-quadratic  case  attempt  to  approximate  V*T(x). 
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In  Ruhe’s  algorithm,  the  matrix  JTJ  is  used  as  the  preconditioner,  and  an  orthog¬ 
onal  factorization  of  J  is  used  to  compute  the  necessary  quantities.  The  method  is 
applied  to  problems  in  which  the  residuals  are  nonzero  and  the  Jacobian  has  full  rank, 
and  is  restarted  every  n  iterations.  He  concludes  that  the  preconditioned  conjugate- 
gradient  method  never  increases  the  total  number  of  iterations  required  to  solve  a  given 
problem  relative  to  Gauss-Newton,  and  that  significant  improvements  in  the  speed  of 
linear  convergence  of  Gauss-Newton  on  large-residual  problems  can  be  achieved  with 
conjugate-gradient  acceleration. 

Al-Baali  and  Fletcher  [1985]  point  out  that  conjugate-gradient  acceleration  of  the 
type  described  by  Ruhe  is  equivalent  to  applying  a  BFGS  update  to  the  Gauss-Newton 
approximate  Hessian  JTJ  at  each  step.  They  implement  and  test  both  this  method 
(without  restarts)  and  a  scaled  version,  where  the  scale  parameter  r  is  chosen  to  minimize 
Aic(Tj^Jk\yk)  as  a  function  of  r  (see  (5.13)).  They  give  no  conclusions  as  to  the 
relative  efficiency  of  the  scaled  and  unsealed  versions  of  the  method,  but  find  that  the 
modified  methods  offer  some  improvement  over  Gauss-Newton,  while  exhibiting  the 
same  difficulties. 

7.  Sequential  Quadratic  Programming  (SQP)  Methods 

Fraley  [1987a]  proposes  algorithms  that  solve  quadratic  programming  subproblems 
whose  formulation  is  based  on  convergence  properties  of  sequential  quadratic  program¬ 
ming  methods  for  constrained  optimization,  and  on  geometric  considerations  in  non¬ 
linear  least  squares.  The  motivation  behind  these  methods  is  as  follows.  Recall  that 
the  Hessian  matrix  of  the  least-squares  objective  can  be  separated  into  the  sum  of  two 
components  involving  different  types  of  derivative  information  : 

V2  =JtJ  +  B, 

where 

m 

*s£*\7**. 

<=i 

The  corrected  Gauss-Newton  methods  (Section  4)  calculate  a  search  direction  that  is 
separated  into  two  orthogonal  components  when  0  <  grade(J)  <  n,  and  can  be  viewed 
as  SQP  methods.  When  grade(J)  =  rank(J)  <  n,  the  contributions  of  JTJ  and  of 
B  (or  of  an  approximation  to  B)  are  essentially  decoupled  because  the  contribution  of 


JT  J  in  the  projected  Hessian  is  zero.  No  such  separation  is  possible  when  rank(J)  =  n. 
In  any  case,  grade(J)  <  n  may  be  selected  based  on  the  progress  of  the  minimization 
as  well  as  the  singular  values  of  J,  so  that  partial  separation  of  JTJ  and  B  may  occur 
between  the  extremes  of  Gauss-Newton  (grade(J)  =  rank(J)),  and  a  full  Newton-type 
method  (grade(J)  =  0).  The  strategy  of  making  a  quasi-Newton  approximation  to 
B  which  is  then  added  to  JTJ  in  a  full  Newton-type  method  has  not  been  successful 
outside  a  neighborhood  of  the  solution,  unless  it  is  combined  with  other  techniques  (see 
Section  5).  The  approach  taken  in  Fraley  [1987a]  is  to  use  a  quasi-Newton  approximation 
to  the  full  Hessian,  while  separating  out  some  of  the  contribution  to  the  curvature  due 
to  JTJ  by  including  first-order  information  about  the  residuals  as  constraints. 

A  search  direction  is  computed  as  the  solution  to  a  quadratic  program  (QP)  of  the 
form 

min  gTp  +  ~  pT  Hp  (7.1) 

subject  to 

—bL  <  Ap  +  c  <bv 

where 

bl  >  0  and  bv  >  0. 

In  SQP  methods  for  constrained  optimization,  H  approximates  the  Hessian  of  a  La- 
grangian  funtcion  in  order  to  take  into  account  the  curvature  of  the  constraints  that  are 
active  at  the  solution  (e.  g.,  Powell  [1983],  Gill  et  al.  [1985b,  1986b],  Nocedal  and  Over- 
ton  [1985],  Stoer  [1985],  and  Gurwitz  [1986]).  For  nonlinear  least  squares,  it  suffices  for 
H  to  approximate  the  Hessian  matrix  of  fTf  even  if  some  of  the  constraints  in  (7.1) 
are  active  at  a  solution  x*,  because  g(xm)  =  0.  These  methods  have  the  potential  to 
converge  faster  than  quasi-Newton  methods  for  unconstrained  optimization,  since  only 
the  projection  of  the  Hessian  in  the  null  space  of  the  active  QP  constraint  normals  — 
rather  than  the  full  Hessian  —  need  be  positive  definite  as  a  condition  for  superlinear 
convergence. 

Two  classes  of  suitable  QP  constraints  for  (7.1)  are  described  :  constraints  on  the 
directional  derivatives  of  individual  residuals,  and  constraints  based  the  QR  factorization 
of  J.  A  departure  from  other  algorithms  is  that  information  about  the  residuals,  and 
interrelationships  between  residuals,  can  be  used  to  construct  the  subproblems  (the 


algorithm  of  David  on  [1976]  is  an  exception  —  see  Section  11).  In  the  SQP  algorithms, 
a  set  C  of  desirable  constraints  is  chosen  first,  which  may  be  infeasible  or  may  otherwise 
exclude  aH  suitable  search  directions.  For  example,  such  a  set  of  constraints  is 


{V4>iP  =  fc ;  «  =  l,2,...,m}.  (7.2) 

Any  p  satisfying  V<f>Jp  =  ~4>i  is  a  descent  direction  for  fc  if  0  and  is  otherwise 
orthogonal  to  V<&.  The  unconstrained  minimum  pQW  of  the  QP  objective  in  (7.1)  is  a 
descent  direction  for  the  nonlinear  least-squares  objective  provided  H  is  positive  definite. 
Therefore,  as  long  as  pQN  is  considered  satisfactory,  an  acceptible  search  direction  will 
eventually  be  obtained  by  either  removing  some  constraints  from  C,  or  else  by  perturbing 
the  constraints  in  C  so  as  to  enlarge  the  feasible  region.  Based  on  this  reasoning,  she 
proposes  two  different  strategies  (which  could  also  be  combined). 

One  strategy  uses  a  QP  to  select  a  subset  of  constraints  in  C  as  the  feasible  region 
for  (7.1).  Several  quadratic  programs  may  be  solved  within  a  single  iteration  in  order 
to  compute  a  search  direction,  which  is  justified  for  two  reasons.  First,  starting  the 
solution  process  for  a  QP  with  information  about  the  solution  of  a  related  subproblem 
can  often  lead  to  significant  savings  in  QP  iterations  (see,  e.  g.,  Gill  et  al.  [1985a]). 
Also,  when  the  cost  of  a  function  evaluation  is  much  greater  than  the  cost  of  a  QP 
iteration,  the  effort  involved  in  obtaining  the  search  direction  by  solving  more  than  one 
subproblem  may  be  worthwhile  if  it  results  in  a  substantial  reduction  in  the  number  of 
outer  iterations. 

It  is  difficult  to  automate  the  selection  of  QP  constraints,  and  the  evaluation  of 
the  current  QP  solution  as  a  candidate  for  the  search  direction.  For  example,  each  of 
the  constraints  in  (7.2)  could  be  considered  separately  in  order  of  decreasing  residual 
size,  with  the  object  of  including  as  many  of  the  constraints  as  possible.  A  constraint  is 
added  to  the  current  constraint  set  (initially  empty)  if  the  corresponding  QP  computes 
an  ‘‘acceptable**  search  direction  p.  In  addition  to  the  requirement  that  gTp  <  0,  Fraley 
uses  a  lower  bound  on  the  magnitude  of  p,  and  an  upper  bound  on  cos(g,p),  as  the 
criteria  for  accepting  p.  Some  other  examples  that  use  constraints  based  on  the  QR 
factorization  are  very  similar  to  corrected  Gauss- Newton  methods  (Section  4). 

In  the  second  approach,  constraints  in  C  are  modified  in  order  to  obtain  a  suitable 
feasible  region.  This  is  accomplished  by  treating  constraint  bounds  as  variables  in  a  QP. 
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Using  the  constraint  set  (7.2),  Fraley  shows  how  these  SQP  algorithms  are  related  to 
Gauss- Newton  and  Levenberg-Marquardt  methods.  The  QP 

min  brb 

b-,p 

subject  to 


-b  <  Jp  +  f  <  b  (7.3) 

b>  0, 

computes  the  smallest  possible  perturbation  that  allows  all  of  the  (7.2)  to  intersect.  In 
the  solution  (b;  p)  to  (7.3),  the  vector  p  »s  a  Gauss-Newton  search  direction.  When  J 
is  ill-conditioned,  it  is  possible  that  the  constraints  in  (7.2)  do  intersect  (6  =  0),  but 
that  the  intersection  occurs  at  a  vector  p  that  is  very  large  in  magnitude.  For  u  >  0, 
the  QP 

min6T6  +  wpTp 

»;p 

subject  to 


~b  <  Jp+  f  <  b  (7.4) 

b>0, 

forces  ||6||  to  increase  when  ||p(|  would  otherwise  be  large.  In  the  solution  (b ;  p)  to  (7.4), 
the  vector  p  is  a  Levenberg-Marquardt  search  direction.  In  an  SQP  algorithm  based  on 
(7.3)  (respectively,  (7.4))  there  is  the  option  of  using  p  (p)  as  a  search  direction,  or  of 
using  b  ( b )  to  define  bounds  for  a  second  QP  of  the  form  (7.1),  from  which  the  search 
direction  is  computed. 

Fraley  proposes  a  number  of  variations  of  these  basic  SQP  algorithms  and  tests  some 
of  them  on  a  set  of  fourteen  problems.  She  uses  the  BFGS  method  to  approximate  H 
in  (7.1)  just  as  in  unconstrained  optimization,  and  observes  that  the  approximation 
retains  positive  definiteness  throughout.  She  finds  the  SQP  methods  work  well  on  some 
problems,  and  poorly  on  some  others,  so  that  it  is  not  possible  to  say  anything  conclusive 
about  their  performance  relative  ‘o  existing  methods. 

8.  Continuation  Methods 
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Continuation  methods  have  also  been  applied  to  nonlinear  least-squares  problems. 
These  methods  solve  a  sequence  of  parameterized  subproblems 


min$(z;r,);  i  = 


(8.1) 


where 

0  =  T0  <  n  <  . . .  <  rim„  =  1 

and 

arg  min  $(x;0)  =  xo  and  azg  min  $(*;  1)  =  x*. 

The  idea  is  that  methods  that  have  fast  local  convergence,  but  may  not  be  robust  in  a 
global  sense,  can  be  applied  to  solve  each  subproblem  in  relatively  few  steps,  because 
information  from  the  solution  of  previous  subproblems  may  be  used  to  predict  a  good 
r  starting  value  for  the  next  one. 

DeVilliers  and  Glasser  [1981]  define 

*(*;  T)  =  \  ll/OOIIz  +  \  (rk  -  1)11/(20)1(2  (8.2) 

where  k  is  a  positive  integer,  with  a  fixed  spacing  between  the  parameters  r<  in  (8.1). 
They  test  two  different  continuation  methods,  one  that  uses  Newton's  method  (with  line- 
search)  to  solve  the  intermediate  problems,  and  one  that  uses  a  Gauss-Newton  method 
(with  linesearch).  An  unspecified  "device”  is  included  in  the  implementation  of  both 
minimization  techniques  to  ensure  a  decrease  in  the  objective  at  every  iteration.  The 
continuation  methods  are  compared  with  results  obtained  by  applying  both  minimization 
algorithms  to  the  original  problem.  Intermediate  subproblems  are  not  solved  exactly  ;the 
criterion 

||Vx$(x;t<)[|2  <  Ci , 

where  e,  =  10-2  if  i  <  «m*x.  and  etniM  =  10-6,  is  used  to  determine  convergence  of  a 
subproblem. 

Numerical  experiments  are  carried  out  on  three  different  test  problems,  with  multiple 
starting  values,  most  of  which  are  points  of  failure  for  both  Newton's  method  and  Gauss- 
Newton.  They  conclude  that,  although  the  continuation  method  is  less  efficient  than 
the  underlying  method  when  both  are  successful,  it  will  converge  on  many  problems  for 
which  the  underlying  method  fails  when  used  alone.  However,  the  results  they  present 
are  for  different  values  of  the  step  size,  and  the  exponent  k,  and  no  mechanism  is  given 
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for  the  automatic  choice  of  either  of  the  parameters.  DeViliiers  and  Glasser  point  out 
that  their  methods  may  require  modification  if  the  optimization  method  that  is  used 
to  solve  the  subproblems  encounters  difficulties,  or  if  the  continuation  path  is  not  well- 
behaved.  Fraley  [1987a,  1988]  observes  that  the  first  two  test  problems  of  DeViliiers 
and  Glasser  are  very  sensitive  to  the  choice  of  the  maximum  step  bound,  or  the  initial 
trust-region  size  for  most  methods  and  that  the  methods  can  be  quite  efficient  provided 
an  appropriate  non-default  choice  is  made  for  these  parameters. 

Salane  [1987]  incorporates  a  trust-region  strategy  into  a  continuation  method  by 
defining 

*(*;r)  =  \  (||/(*)||,  +  (r  -  l)||/(*o)||?  +  A(r  -  l)||D(x  -  »0)|g)  ,  (8.3) 

and  then  applying  Gauss-Newton  to  this  function  for  the  inner  iterations.  Instead  of 
allowing  the  continuation  parameter  r  to  range  from  0  to  1,  he  advocates  stopping 
when  it  becomes  inefficient  to  solve  the  subprobiems,  and  then  restarting  the  method 
after  replacing  xQ  by  the  new  iterate.  He  points  out  that  his  approach  is  especially 
suitable  for  large-residual  problems,  because  it  transforms  the  original  problem  into  a 
sequence  of  subproblems  with  small  residuals.  The  idea  is  to  attempt  to  determine  when 
the  neglected  terms  become  significant,  and  then  pose  a  new  subproblem.  An  initial 
value,  Ti,  of  the  continuation  parameter  must  be  supplied  by  the  user  in  order  to  start 
the  method.  Should  any  step  fail  to  obtain  a  decrease  in  either  the  nonlinear  least- 
squares  objective  or  its  gradient,  T\  is  decreased,  and  the  calculation  is  repeated  without 
changing  x0-  Theorems  on  descent  conditions  and  convergence  are  presented.  Salane 
argues  that  his  continuation  method  allows  direct  selection  of  the  Levenberg-Marquardt 
parameter  A  in  (8.3),  because  A  may  be  chosen  so  that  the  term  A(1  -  t)DtD  behaves 
somewhat  like  the  second-order  terms  that  have  been  neglected  in  the  Hessian  of  $(z;  r). 
However,  no  mechanism  is  suggested  for  automatic  choice  of  A,  and  A  =  ||/(xo)||2  »s 
used  in  the  tests. 

Salane  gives  test  results  for  a  version  of  his  algorithm  on  ?  set  of  nine  problems 
(all  of  which  are  included  in  our  set).  A  comparison  is  made  to  results  obtained  from 
MINPACK,  and  also  to  the  results  reported  by  DeViliiers  and  Glasser  [1981]  for  two  of 
the  test  problems.  He  concludes  that  the  performance  of  the  method  compares  favorably 
with  that  of  MINPACK,  and  is  superior  to  the  DeViliiers  and  Glasser  continuation  method 
on  the  relevant  problems.  The  matrix  D  in  (8.3)  is  taken  to  be  the  identity  matrix 
throughout  the  tests,  and  for  one  test  problem  a  type  of  variable  scaling  is  used.  No 
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information  is  given  concerning  scaling  for  the  MINPACK  tests.  The  results  that  are 
presented  correspond  to  several  different  values  of  tj,  although  the  criterion  used  in 
choosing  this  value  is  not  given.  Test  results  in  which  the  value  of  rt  is  varied  are  included 
for  three  of  the  problems  for  the  purpose  of  showing  that  performance  is  sensitive  to  the 
specification  of  the  continuation  parameter. 

9.  Modifications  of  Unconstrained  Optimization  Methods 

Besides  Gauss- Newton  methods,  several  straightforward  modifications  of  uncon¬ 
strained  optimization  methods  are  possible  for  nonlinear  least  squares.  In  quasi-Newton 
methods,  JjJ0  can  be  used  as  the  initial  approximation  to  the  Hessian  matrix.  Ramsin 
and  Wedin  [1977]  report  favorable  results  with  this  technique.  We  note  that  a  perturbed 
matrix  Jj  j0  can  be  used  as  the  initial  approximate  Hessian,  where  J0  is  a  modified 
Cholesky  factor  of  JqJ0  (Gill  and  Murray  [1974])  ,  in  order  to  maintain  positive  defi¬ 
niteness  when  Jo  is  ill-conditioned. 

Wedin  [1974]  (see  also  Ramsin  and  Wedin  [1977])  suggests  a  modification  of  New¬ 
ton's  method  in  which  the  search  direction  is  defined  by 

(JT  J  +  £<fcV2<MP  =  ~9,  (9.1) 

t=i 

where  fa  is  the  *th  component  of  the  projection  /  of  /  onto  TZ(J).  This  iteration 
approaches  Newton's  method  in  the  limit,  since  /(x*)  =  /(x*),  and  is  parameter- 
independent,  in  the  sense  that  minimization  of  /  as  a  function  of  x  is  equivalent  to 
minimization  of  /  as  a  function  of  a  new  variable  z  —  provided  the  mapping  that 
defines  x  as  a  function  of  z  has  a  nonsingular  Jacobian.  An  obvious  difficulty  is  that  /, 
and  hence  (9.1),  is  not  well-defined  when  J  is  ill-conditioned. 

Recall  that  in  quasi- Newton  methods  for  unconstrained  optimization,  the  approxi¬ 
mate  Hesian  matrix  is  required  to  satisfy  the  condition 

HkSk  =  9fe>  (9.2) 

where 

=  Xfc+ 1  -  ifc  and  y*  =  gk+i  -  gk 

(e.  g.  Dennis  and  Mor4  [1977]).  Al-Baali  and  Fletcher  [1985]  suggest  the  use  of  y* 
defined  by  (5.12)  rather  than  y*  =  g>-+i  —  gk  in  the  quasi-Newton  condition  (9.2). 


They  report  improvements  with  the  BFGS  and  DFP  formulas  when  this  substitution 
is  made.  However,  they  remark  that  the  condition  yjsk  >  0  for  hereditary  positive 
definiteness  of  the  updates  is  not  guaranteed  by  the  linesearch  requirements,  and  they 
replace  yja*  in  the  update  formulas  by  max  {yja*,0.01yjs*}  as  a  safeguard.  They  do 
not  consider  this  a  major  drawback,  because  yjsk  >  0.01yjs/t  almost  always  occurred 
in  their  examples.  A  somewhat  different  safeguard  is  used  in  a  later  related  paoer  [see 
Fletcher  and  Xu  (1986),  p.  26]  discussed  below. 

Al-Baali  and  Fletcher  [1985]  also  develop  several  hybrid  linesearch  methods  in  which 
the  models  are  assessed  in  terms  of  the  function  A  *  defined  by  (5.13),  an  approximate 
measure  of  the  error  in  the  inverse  Hessian.  In  one  class  of  methods,  the  modified  BFGS 
update  described  in  the  preceding  paragraph  is  applied  to  a  matrix  of  the  form 

Hk+ 1  =  (1  -  8k)Hk  +0fcT-k«Jfc+1«Jfc+l> 

where  t*  minimizes  Afc(rfc7j+1  y*),  and  0*  is  chosen  to  minimize  Ak{Hk+i',yk), 
in  order  to  obtain  the  new  approximate  Hessian.  In  their  implementation,  in  which  Ok 
is  restricted  to  be  either  0  or  1,  they  find  that  the  method  has  difficulties  on  singular 
problems,  and  that  the  scaling  of  the  search  direction  often  does  not  alfow  a  =  1  as  a 
trial  step  in  the  linesearch.  They  refer  to  Al-Baali  [1984]  for  more  details  of  the  tests. 

Another  class  of  hybrid  methods  defined  by  Al-Baali  and  Fletcher  compares  the 
value 

Aon  =  A  k(Hk',yk) 

for  the  current  quasi-Newton  approximation  Hk  with 

Acn  =  Ak(Jk+lJk+l'ii/k) 

for  the  Gauss-Newton  approximation.  The  basic  algorithm  can  be  summarized  as  follows 

if  A<3*  <  A  on  then  use  the  modified  BFGS  search  direction 

,  (9.3) 

else  use  the  Gauss-Newton  search  direction 

They  test  several  versions  of  this  method  that  differ  in  the  action  taken  whenever  a 

switch  from  Gauss-Newton  to  quasi-Newton  takes  place.  In  one,  Hk+i  is  reset  to 

jJ+iJk+i'  while  in  another  Hk+i  is  reset  to  the  result  of  applying  the  modified  BFGS 

update  to  (conjugate-gradient  acceleration).  They  observe  little  difference  in 

performance  between  these  two  alternatives,  and  find  them  to  be  the  best  of  the  many 


methods  for  nonlinear  least  squares  treated  in  their  study.  A  version  of  the  first  strategy 
that  substitutes  the  quantity  minr  for  AON  in  the  comparison  with 

Aqh  is  also  tried,  but  it  is  found  to  have  some  difficulties  on  a  problem  for  which  the 
Jacobian  is  singular  at  the  solution.  A  final  variant  maintains  the  quasi- Newton  update 
throughout,  and  never  resets  the  approximate  Hessian.  They  find  that  this  method  is 
not  as  efficient  as  the  others  on  some  types  of  large- residual  problem. 

Fletcher  and  Xu  [1986]  give  an  example  in  which  the  hybrid  method  (9.3)  has  a 
linear  rate  of  convergence  when  the  BFGS  method  would  converge  superlinearly.  The 
difficulty  is  that  the  comparison  between  AQS  and  AON  may  fail  to  distinguish  between 
zero-residual  problems  and  those  with  nonzero  residuals.  They  propose  two  new  hybrid 
algorithms  and  show  them  to  be  superlinearly  convergent.  The  first  algorithm  computes 
the  modified  BFGS  search  direction  if 


IlfeMli  -  !Mk  <  (9.4) 

for  some  fixed  a  £  (0, 1),  and  a  Gauss- Newton  step  otherwise.  The  method  is  motivated 
by  the  following  relationship 

Il/(«*)ll,  -  ll/(«*+i)l|a  f  0,  if  ||/(OII2  *  0; 

||/(**)lla  ~U  if||/(**)lla  =  0. 

The  second  algorithm  computes  a  modified  BFGS  step  if 


H/(«fc)-/(«*+i)lla 

ll/(**)lla 


and 


Ak(Jk+iJk+yyk)  > 

Afc(7j  pk) 


(9.5) 


where  both  a  and  7  are  fixed  parameters  in  (0,1),  and  a  Gauss-Newton  step  otherwise. 
The  additional  condition  for  choosing  the  BFGS  search  direction  is  derived  from  another 
asymptotic  relationship 


..  Ak(Jk+i^k+yyk)  ( 0,  if  11/(011,  =  0; 

Ak(JjJk-,yk)  U  if||/(**)lla/o. 

Numerical  results  are  given  for  a  set  of  fifty-six  test  problems,  a  few  with  multiple  starting 
values.  They  conclude  that  the  new  methods  offer  some  overall  improvement  over  those 
based  on  (9.3),  but  that  there  is  no  reason  to  prefer  the  more  complicated  test  (9.5) 
over  (9.4). 


10.  Special  Linesearch  Procedures 


Lindstrom  and  Wedin  (1984]  and  Al-Baali  and  Fletcher  [1986]  propose  specialized 
linesearch  methods  for  nonlinear  least-squares  problems  in  which  each  residual  is  inter¬ 
polated  by  a  quadratic  function,  in  contrast  to  the  strategy  of  interpolating  to  the  sum 
of  squares  used  in  conventional  linesearches  for  unconstrained  minimization.  As  a  result 
a  quartic  polynomial,  rather  than  a  simpler  cubic  or  quadratic,  is  minimized  at  each 
iteration  of  the  linesearch. 

Lindstrom  and  Wedin  substitute  their  linesearch,  which  uses  only  function  values, 
for  the  quadratic  interpolation  and  cubic  interpolation  routines  in  the  NAG  Library 
(1980  version)  nonlinear  least-squares  algorithm  E04GBF  (see  Sections  4  and  5),  and 
compare  the  performance  with  the  NAG  linesearch  routines  on  a  set  of  eighteen  test 
problems.  They  find  that  no  linesearch  algorithm  is  superior  over  all,  but  that  their 
algorithm  makes  a  better  initial  prediction  to  the  steplength  that  minimizes  the  sum  of 
squares  along  the  search  direction.  In  a  second  set  of  tests  that  includes  multiple  starting 
values  for  many  of  the  test  problems,  they  add  a  modified  version  of  their  linesearch 
algorithm  that  reverts  to  a  simple  backtracking  strategy  if  an  acceptable  decrease  in  the 
sum  of  squares  is  not  obtained  after  two  function  evaluations.  They  observe  that  their 
modified  method  requires  fewer  function  evaluations  than  either  of  the  NAG  linesearch 
routines,  and  that  the  total  for  their  original  method  falls  between  cubic  interpolation 
and  quadratic  interpolation  to  the  sum  of  squares.  They  note  occasional  inefficiencies  in 
their  methods  due  to  extrapolation,  but  comment  that  such  effects  are  more  pronounced 
for  quadratic  interpolation  of  the  sum  of  squares. 

Al-Baali  and  Fletcher  [1986]  test  similar  linesearch  methods  that  use  gradients  on 
a  set  of  fifty-five  test  problems  with  a  number  of  nonlinear  least-squares  algorithms 
described  in  Al-Baali  [1984]  (see  also  Al-Baali  and  Fletcher  [1985]).  They  conclude  that 
considerable  overall  savings  can  be  made  by  interpolating  to  each  of  the  residuals  rather 
to  than  the  sum  of  squares.  They  also  obtain  favorable  results  for  two  different  schemes 
designed  to  save  Jacobian  evaluations  in  the  new  linesearch. 

11.  Methods  for  Special  Problem  Classes 

Algorithms  have  also  been  formulated  to  treat  some  special  cases  of  the  nonlinear 
least-squares  problem.  For  example,  there  is  a  vast  literature  concerning  methods  specific 
to  nonlinear  equations  that  we  shall  make  no  attempt  to  survey  here. 


In  some  nonlinear  least-squares  problems,  the  vector  x  can  be  separated  into  two 
sets  of  variables,  say 

»G) 

where  it  is  relatively  easy  to  minimize  the  sum  of  squares  as  a  function  of  y  alone.  A 
fairly  common  situation  of  this  type  is  one  in  which  y  is  the  set  of  variables  that  occur 
linearly  in  all  of  the  residuals,  so  that 

tIKOIE 


is  a  linear  least-squares  problem.  For  example,  exponential  fitting  problems  (see  Varah 
[1985])  fall  into  this  category.  Methods  that  deal  with  separable  nonlinear  least-squares 
problems  were  introduced  by  Golub  and  Pereyra  [1973].  Ruhe  and  Wedin  [1980]  survey 
these  methods  and  give  some  extensions.  They  describe  three  basic  algorithms,  all 
of  which  use  Gauss-Newton  to  minimize  the  sum  of  squares  as  a  function  of  y.  The 
methods  differ  in  the  definition  of  the  quadratic  model  function  for  minimization  with 
respect  to  z.  The  Jacobian  and  Hessian  of  the  nonlinear  least-squares  objective  can  be 
partitioned  as  follows : 


J  =  (Jy  J .) 


oz) 


S+(5 


’ vv 


BZJ' 


so  that 


and 


=G„-Gl,G^G,v 

=  (jJj,  +  By.)  -  (Jjjv  +  +  Byy)-\jJjy  +  Bty). 

The  approximate  Hessians  that  are  considered  for  the  minimization  as  a  function  of  z 
are 

j?j,-Giy{j?jyriGty,  (n.i) 

jJj,  -  JyJ'VyJyr'jJjy,  (11-2) 
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yfn-*,'- 


f 


and 

Jjj,.  (11.3) 

Algorithms  based  on  (11.1)  and  (11.2)  are  shown  to  converge  at  a  faster  rate  than  the 
conventional  Gauss-Newton  method,  while  the  asymptotic  convergence  rate  for  (11.3) 
may  be  much  slower.  On  the  other  hand,  of  the  three  quadratic  models,  it  is  least 
expensive  to  compute  solutions  with  the  approximate  Hessian  (11.3),  and  most  expensive 
to  compute  them  from  (11.1).  Use  of  (11.2)  costs  about  the  same  as  a  conventional 
;  Gauss- Newton  method.  Tests  on  four  sample  problems  are  given  to  illustrate  rates  of 

convergence. 

Davidon  [1976]  introduces  a  quasi-Newton  method  for  problems  in  which  (t)  m  > 
n,  (ii)  location  of  the  minimum  is  not  very  sensitive  to  weighting  of  the  residuals,  and 
F  (Hi)  rapid  approach  to  a  minimum  is  more  important  than  convergence  to  it.  A  new 

estimate  of  the  minimum  is  computed  after  each  individual  residual  and  its  gradient  are 
evaluated,  rather  than  after  evaluating  the  entire  block  of  m  residuals.  Davidon  gives 
an  analogy  to  time-dependent  measurements  of  experimental  data,  in  which  quantities 
calculated  from  the  measurements  are  updated  each  time  a  new  observation  is  made. 
Starting  from  an  initial  quadratic  approximation 

qo(x)  =  f(x0)Tf(x0)  +  (x  -  xq)tHq1(x  -  x0), 

with  Ho  positive-definite,  the  algorithm  that  determines  the  next  iterate  is  equivalent 
to  minimizing  a  quadratic  function  of  the  form 

qk+i(x)  =  [<t>j(xk)  +  (*  -  z*)TV0j(xfc)]2  +  A kqk(x), 

where  A*  is  in  (0,  l!.  It  is  suggested  that  the  choice  of  {A*}  should  be  problem- 
dependent,  and  some  alternatives  are  proposed.  Davidon  tests  the  method  on  a  set 
of  four  problems  in  which  he  varies  the  size  of  the  problem,  the  initial  estimate  of  the 
solution,  and  the  sequence  {A*}.  He  observes  that  the  method  tends  to  oscillate  about 
a  minimum  rather  than  converging  to  it,  but  that  it  often  redu:es  the  sum  of  squares 
more  rapidly  than  other  methods. 

Further  computational  experiments  with  Davidon ’s  method  are  reported  in  Cornwell, 
Kocman,  and  Prosser  [1980].  On  a  set  of  fifteen  zero-residual  problems,  they  test  the 
k  method  with  various  fixed  values  of  A*.  They  obtain  overflow  in  most  cases  for  small 

values,  but  otherwise  find  that  the  efficiency  of  the  method  decreases  as  A*  is  increased. 
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In  one  case,  the  method  cycled  through  a  sequence  of  points  that  was  not  near-optimal. 

On  the  basis  of  these  observations,  they  implement  a  new  version  that  attempts  to  use 
a  fixed,  relatively  small  value  of  A*,  restarting  from  the  initial  vector  with  a  larger  value 
if  it  is  determined  that  overflow  would  otherwise  occur.  They  find  that  this  modified 
implementation  of  Davidon's  method  is  competitive  with  the  computer  program  LMCHOL 
from  Argonne  National  Laboratory  based  on  Fletcher's  [1971]  Levenberg-Marquardt 
algorithm  (which  has  since  been  superseded  by  the  MINPACK  routine  LMDER  [Mori, 

Garbow,  and  Hillstrom  (1980)]). 
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